SeVeN: Semantic Vector Networks
Welcome to the home page of SeVeN, Semantic Vector Networks. SeVeN is a resource that aims at bridging the gap between discrete relation labels that may be found in resources like WordNet and attributional and relational information naturally encoded in word embeddings.
SeVeN is a semantic network, but each edge is itself a vector. The current version is derived from building a ~1M edge graph from the English Wikipedia by leveraging pairwise PPMI word associations. Then, a relation vector for each pair of words is learned. Finally, each relation vector is compressed and purified with an autoencoder architecture, reducing the size up to only 10 dimensions.
The current release is based on the Google News word2vec embeddings .
- Download the 1800d original vectors from here. Below is an output sample.
sevenlong.most_similar('roman_numerals') >>> ('arabic_numerals', 0.9820329546928406) sevenlong.most_similar('french_revolution') >>> ('1789_revolution', 0.9796866178512573) sevenlong.most_similar('netflix_streaming') >>> ('hulu_streaming', 0.9768953323364258)
- Download the 10d purified vectors from here. Below is an output sample.
seven.most_similar('roman_numerals') >>> ('arabic_alphabet', 0.9980242848396301) seven.most_similar('french_revolution') >>> ('cuban_revolution', 0.9987969398498535) seven.most_similar('netflix_streaming') >>> ('playstation_console', 0.9971717596054077)
A Working Example
Get corpus and preprocess
In this tutorial we assume a large text file as initial corpus already tokenized. We will use as example the biomedical corpus provided in the SemEval 2018 task on Hyeprnym Discovery, a 130M word corpus consisting on abstracts and full papers from pubmed. This corpus has 3,239,945 lines. After cloning the repo, download and save to
Then, keep the top10k words by running this command:
python3 src/preprocess/get_vocab.py -c data/corpora/pubmed/2A_med_pubmed_tokenized.txt -n 10000 -sw data/resources/english_stopwords.txt -o data/corpora/pubmed/
- -c corpus file - -n top *n* words to consider - -o output folder where a frequency file ending in _frequencies.tsv will be saved
Please choose the number of splits according to the number of threads you can allocate to each of the subsequent processes. In this example we will assume 25 threads are available to use, and given the size of the corpus, each chunk should have about 130k lines.
Split the corpus with:
python src/preprocess/split_corpus.py -c data/corpora/pubmed/2A_med_pubmed_tokenized.txt -o data/corpora/pubmed/ -n 130000
This script produces in the
-o output folder
x chunks of the corpus of the name
With the vocabulary built and the corpus divided in ~equal sized chunks, the goal is to generate a *pmi set of pairs. The first step is to build a cooccurrence matrix between for the words in the vocabulary. We generate a weighted cooccurrence matrix, where we weigh each cooccurrence by the distance (in tokens) between center and context word. We consider all words in the frequency vocabulary as valid center and context words. First, we iterate over the corpus and save triples of the form
< center_word , context_word , cooc >, where
cooc=1/distance between center_word and context_word.
Get them with:
python3 src/preprocess/get_triples_launcher.py -c data/corpora/pubmed/ -f data/corpora/pubmed/2A_med_pubmed_tokenized.txt_frequencies.tsv -o data/corpora/pubmed/triples/
If the frequency file you are using is external, it is possible to set a cutoff vocabulary threshold with the
-v flag. This script will generate two files per split (or chunk), one with raw coocurrences, and one with the weighted ones. In this example we only use the weighted ones, but feel free to experiment with other cooc weighting schemata.
From triples to cooc matrix and pmi rankings
The set of triples generated in the previous step can be iterated over to aggregate coocurrences into one single cooc matrix.
Issue this command:
python3 src/preprocess/get_dict_matrix_from_triples.py -c data/corpora/pubmed/triples -o data/corpora/pubmed/cooc
This will iterate over all triples and generate three files in the
- N_vals.txt – contains two lines: sum of raw and weighted cooccurrences - W_raw – raw coocurrence matrix - W_weight – weighted cooccurrence matrix
Then, in order to convert (raw or weighted) cooccurrence matrix into a (P)PMI matrix, we call:
python3 -i src/preprocess/cooc2pmi.py -d data/corpora/pubmed/cooc/W_weight -n data/corpora/pubmed/cooc/N_vals.txt -o data/corpora/pubmed/pmi/ -t 500
- -d – Path of the raw cooccurrence dumped dictionary. - -t – Top K context words (by PPMI) to be extracted for each center word
This step will produce a text file of
|V| lines, with
K columns, where
K is the number of pairs to be obtained. The first column contains the center word, and then from the second onwards each context word is concatenated with its PPMI score with
_. For example, for the center word well:
well versed_5.84 proportioned_4.88 muscled_4.83 mannered_4.83 faring_4.79 artesian_4.60 meshed_4.47 camouflaged_4.45 ...
It is advisable to set K to a large number so we can then find and filter pairs by min frequency in the corpus (e.g., the top PPMI context word for style is gangman, which is probably fairly infrequent, and we may not want such a specific association, but rather neo-classic or renaissance).
Generate graph edges
With the ppmi pairs extracted, the next step is to produce a file with target word pairs for which contexts will be extracted. In order to do this, in addition to a min threshold of ppmi score (or the top
N highest context words for each center word ranked by ppmi), we will also consider overall word frequency and minimum raw cooccurrence (for an edge to be meaningful we want it to occur at least in
X sentences, regardless of ppmi association strength).
Run the following command:
python3 -i src/preprocess/get_graph_nodes.py -pmi data/corpora/pubmed/pmi/ppmi_pairs_topk=500.tsv -f data/corpora/pubmed/2A_med_pubmed_tokenized.txt_frequencies.tsv -v 10000 -rawcooc data/corpora/pubmed/cooc/W_raw -minf 50 -mincooc 10 -npairs 20 -o data/corpora/pubmed/pmi
- -pmi – Path of the ppmi pairs (output of previous step) - -rawcooc – Raw cooccurrence matrix - -minf – Minimum word frequency - -mincooc – Minimum raw cooccurrences - -npairs – At most select these many edges for each context word.
This step will produce a text file named:
-o folder, where
Z refer to the corresponding arguments. The number of total pairs might be lower than the expected
center*context because there might not be enough word pairs satisfying the desired thresholds.
TO-DO: Change background execution to multiprocess queue
With the selected edges in place, the next step is to extract left, mid and right context (and reversed) for each word pair.
python3 -i src/preprocess/get_contexts_launcher.py -c data/corpora/pubmed/ -p data/corpora/pubmed/pmi/selected_pairs_minfreq_50_maxpairs_20_mincooc_10.tsv -mw 10 -sw 10 -o data/corpora/pubmed/contexts/
- -c – Input batches corpus - -o – Output folder - -p – Generated pairs file - -mw – Mid word context - -sw – Side (left and right) word context
As a preprocess step before vectorization, run:
python3 -i src/preprocess/collect_contexts.py -d data/corpora/pubmed/contexts/ -r data/corpora/pubmed/pmi/selected_pairs_minfreq_50_maxpairs_20_mincooc_10.tsv -o data/corpora/pubmed/contexts/contexts.tsv
TO-DO: Add weights
The idea is to build a vector of
6*d dimensions, where
d is the dimensionality of the pretrained word embeddings of choice. We do this by running:
python3 -i src/preprocess/vectorize.py -r data/corpora/pubmed/contexts/contexts.tsv -wv GoogleNews-vectors-negative300.bin -p data/corpora/pubmed/pmi/selected_pairs_minfreq_50_maxpairs_20_mincooc_10.tsv -o data/corpora/pubmed/vectors
The script iterates over the large relation file twice. First, to retrieve a relation to id mapping, and second, to update the zero matrix with the corresponding values. The output is a large vector space model, which should be manageable by the
To speed up the process, one may want to binarize the text files. You may do this with convertvec, either manually or by running:
python3 src/preprocess/vectors2bin.py -d data/corpora/pubmed/vectors/
Autoencode Relation Vectors
In order to 'purify' and reduce the dimensionality of the original
d*6 relation vectors, we run them through an autoencoder architecture. The script takes as input a relational vector space model and produces, for different dimensionalities, compressed representations after running them through different autoencoder architectures. The architecture used in the Coling paper produces the models ending in
_forget.vec. An additional autoencoded model is also generated for each hidden dimension (ending in
_regular.vec), which is a vanilla autoencoder where input and reconstructed output are the same.
You may run the autoencder with:
python3 -i src/preprocess/autoencoder.py -rv data/corpora/pubmed/vectors/relation_vectors.vec.bin -wv ../resources/embeddings/GoogleNews-vectors-negative300.bin.bz -rf data/corpora/pubmed/pmi/selected_pairs_minfreq_50_maxpairs_20_mincooc_10.tsv -o data/corpora/pubmed/vectors/output-folder
Refer to the script to change the hyperparameters. You may run again the
vectors2bin.py script to binarize the resulting models.
Explore Relation Space
The original (1800d space) already yields interesting properties, such as:
>>> for i in model.most_similar('cardiac__arrest'): print(i) ... ('cardiac__tamponade', 0.9013814926147461) ('perioperative__complications', 0.8934778571128845) ('heart__failure', 0.885757327079773) ('cardiopulmonary__arrest', 0.8788020610809326) ('resuscitation__arrest', 0.8765194416046143) ('hypothermia__arrest', 0.873295247554779) ('arrhythmic__death', 0.8706731796264648) ('postoperative__complications', 0.8702256679534912) ('pacing__resynchronization', 0.8638449907302856) ('hyponatremia__tamponade', 0.8615702390670776)
Run Similarity Experiment
TO-DO: Write Readme
Run Classification Experiment
TO-DO: Write Readme
Using SeVeN vectors with Keras
Preliminary interface available at
For further details about the construction of this resource and evaluation details, please refer to the following paper:
Espinosa-Anke, L. and Schockaert, S. SeVeN: Augmenting Word Embeddings with Unsupervised Relation Vectors. Coling 2018. Santa Fe. New Mexico.