HTTPS SSH

SeVeN: Semantic Vector Networks

picture

Welcome to the home page of SeVeN, Semantic Vector Networks. SeVeN is a resource that aims at bridging the gap between discrete relation labels that may be found in resources like WordNet and attributional and relational information naturally encoded in word embeddings.

SeVeN is a semantic network, but each edge is itself a vector. The current version is derived from building a ~1M edge graph from the English Wikipedia by leveraging pairwise PPMI word associations. Then, a relation vector for each pair of words is learned. Finally, each relation vector is compressed and purified with an autoencoder architecture, reducing the size up to only 10 dimensions.

The current release is based on the Google News word2vec embeddings [1].

  • Download the 1800d original vectors from here. Below is an output sample.
sevenlong.most_similar('roman_numerals')[0]
>>> ('arabic_numerals', 0.9820329546928406)
sevenlong.most_similar('french_revolution')[0]
>>> ('1789_revolution', 0.9796866178512573)
sevenlong.most_similar('netflix_streaming')[0]
>>> ('hulu_streaming', 0.9768953323364258)
  • Download the 10d purified vectors from here. Below is an output sample.
seven.most_similar('roman_numerals')[0]
>>> ('arabic_alphabet', 0.9980242848396301)
seven.most_similar('french_revolution')[0]
>>> ('cuban_revolution', 0.9987969398498535)
seven.most_similar('netflix_streaming')[0]
>>> ('playstation_console', 0.9971717596054077) 

A Working Example

Get corpus and preprocess

In this tutorial we assume a large text file as initial corpus already tokenized. We will use as example the biomedical corpus provided in the SemEval 2018 task on Hyeprnym Discovery, a 130M word corpus consisting on abstracts and full papers from pubmed. This corpus has 3,239,945 lines. After cloning the repo, download and save to data/corpora/pubmed.

Then, keep the top10k words by running this command:

python3 src/preprocess/get_vocab.py -c data/corpora/pubmed/2A_med_pubmed_tokenized.txt -n 10000 -sw data/resources/english_stopwords.txt -o data/corpora/pubmed/

where...

-   -c corpus file
-   -n top *n* words to consider
-   -o output folder where a frequency file ending in _frequencies.tsv will be saved

Split corpus

Please choose the number of splits according to the number of threads you can allocate to each of the subsequent processes. In this example we will assume 25 threads are available to use, and given the size of the corpus, each chunk should have about 130k lines.

Split the corpus with:

python src/preprocess/split_corpus.py -c data/corpora/pubmed/2A_med_pubmed_tokenized.txt -o data/corpora/pubmed/ -n 130000

This script produces in the -o output folder x chunks of the corpus of the name split_Y.txt.

Get co-occurrences

With the vocabulary built and the corpus divided in ~equal sized chunks, the goal is to generate a *pmi set of pairs. The first step is to build a cooccurrence matrix between for the words in the vocabulary. We generate a weighted cooccurrence matrix, where we weigh each cooccurrence by the distance (in tokens) between center and context word. We consider all words in the frequency vocabulary as valid center and context words. First, we iterate over the corpus and save triples of the form < center_word , context_word , cooc >, where cooc=1/distance between center_word and context_word.

Get them with:

python3 src/preprocess/get_triples_launcher.py -c data/corpora/pubmed/ -f data/corpora/pubmed/2A_med_pubmed_tokenized.txt_frequencies.tsv -o data/corpora/pubmed/triples/

If the frequency file you are using is external, it is possible to set a cutoff vocabulary threshold with the -v flag. This script will generate two files per split (or chunk), one with raw coocurrences, and one with the weighted ones. In this example we only use the weighted ones, but feel free to experiment with other cooc weighting schemata.

From triples to cooc matrix and pmi rankings

The set of triples generated in the previous step can be iterated over to aggregate coocurrences into one single cooc matrix.

Issue this command:

python3 src/preprocess/get_dict_matrix_from_triples.py -c data/corpora/pubmed/triples -o data/corpora/pubmed/cooc

This will iterate over all triples and generate three files in the -o folder:

-   N_vals.txt – contains two lines: sum of raw and weighted cooccurrences
-   W_raw – raw coocurrence matrix
-   W_weight – weighted cooccurrence matrix

Then, in order to convert (raw or weighted) cooccurrence matrix into a (P)PMI matrix, we call:

python3 -i src/preprocess/cooc2pmi.py -d data/corpora/pubmed/cooc/W_weight -n data/corpora/pubmed/cooc/N_vals.txt -o data/corpora/pubmed/pmi/ -t 500

where...

-   -d – Path of the raw cooccurrence dumped dictionary.
-   -t – Top K context words (by PPMI) to be extracted for each center word

This step will produce a text file of |V| lines, with K columns, where K is the number of pairs to be obtained. The first column contains the center word, and then from the second onwards each context word is concatenated with its PPMI score with _. For example, for the center word well:

well versed_5.84 proportioned_4.88 muscled_4.83 mannered_4.83 faring_4.79 artesian_4.60 meshed_4.47 camouflaged_4.45 ...

It is advisable to set K to a large number so we can then find and filter pairs by min frequency in the corpus (e.g., the top PPMI context word for style is gangman, which is probably fairly infrequent, and we may not want such a specific association, but rather neo-classic or renaissance).

Generate graph edges

With the ppmi pairs extracted, the next step is to produce a file with target word pairs for which contexts will be extracted. In order to do this, in addition to a min threshold of ppmi score (or the top N highest context words for each center word ranked by ppmi), we will also consider overall word frequency and minimum raw cooccurrence (for an edge to be meaningful we want it to occur at least in X sentences, regardless of ppmi association strength).

Run the following command:

python3 -i src/preprocess/get_graph_nodes.py -pmi data/corpora/pubmed/pmi/ppmi_pairs_topk=500.tsv -f data/corpora/pubmed/2A_med_pubmed_tokenized.txt_frequencies.tsv -v 10000 -rawcooc data/corpora/pubmed/cooc/W_raw -minf 50 -mincooc 10 -npairs 20 -o data/corpora/pubmed/pmi

where...

-   -pmi – Path of the ppmi pairs (output of previous step)
-   -rawcooc – Raw cooccurrence matrix
-   -minf – Minimum word frequency
-   -mincooc – Minimum raw cooccurrences
-   -npairs – At most select these many edges for each context word.

This step will produce a text file named:

selected_pairs_minfreq_X_maxpairs_Y_mincooc_Z.tsv

in the -o folder, where X, Y and Z refer to the corresponding arguments. The number of total pairs might be lower than the expected center*context because there might not be enough word pairs satisfying the desired thresholds.

Get contexts

TO-DO: Change background execution to multiprocess queue

With the selected edges in place, the next step is to extract left, mid and right context (and reversed) for each word pair.

python3 -i src/preprocess/get_contexts_launcher.py -c data/corpora/pubmed/ -p data/corpora/pubmed/pmi/selected_pairs_minfreq_50_maxpairs_20_mincooc_10.tsv -mw 10 -sw 10 -o data/corpora/pubmed/contexts/

where...

-   -c  – Input batches corpus
-   -o  – Output folder
-   -p – Generated pairs file
-   -mw – Mid word context
-   -sw – Side (left and right) word context

As a preprocess step before vectorization, run:

python3 -i src/preprocess/collect_contexts.py -d data/corpora/pubmed/contexts/ -r data/corpora/pubmed/pmi/selected_pairs_minfreq_50_maxpairs_20_mincooc_10.tsv -o data/corpora/pubmed/contexts/contexts.tsv

Vectorize Relations

TO-DO: Add weights

The idea is to build a vector of 6*d dimensions, where d is the dimensionality of the pretrained word embeddings of choice. We do this by running:

python3 -i src/preprocess/vectorize.py -r data/corpora/pubmed/contexts/contexts.tsv -wv GoogleNews-vectors-negative300.bin -p data/corpora/pubmed/pmi/selected_pairs_minfreq_50_maxpairs_20_mincooc_10.tsv -o data/corpora/pubmed/vectors

The script iterates over the large relation file twice. First, to retrieve a relation to id mapping, and second, to update the zero matrix with the corresponding values. The output is a large vector space model, which should be manageable by the gensim library.

Binarize

To speed up the process, one may want to binarize the text files. You may do this with convertvec, either manually or by running:

python3 src/preprocess/vectors2bin.py -d data/corpora/pubmed/vectors/

Autoencode Relation Vectors

In order to 'purify' and reduce the dimensionality of the original d*6 relation vectors, we run them through an autoencoder architecture. The script takes as input a relational vector space model and produces, for different dimensionalities, compressed representations after running them through different autoencoder architectures. The architecture used in the Coling paper produces the models ending in _forget.vec. An additional autoencoded model is also generated for each hidden dimension (ending in _regular.vec), which is a vanilla autoencoder where input and reconstructed output are the same.

You may run the autoencder with:

python3 -i src/preprocess/autoencoder.py -rv data/corpora/pubmed/vectors/relation_vectors.vec.bin -wv ../resources/embeddings/GoogleNews-vectors-negative300.bin.bz -rf data/corpora/pubmed/pmi/selected_pairs_minfreq_50_maxpairs_20_mincooc_10.tsv -o data/corpora/pubmed/vectors/output-folder

Refer to the script to change the hyperparameters. You may run again the vectors2bin.py script to binarize the resulting models.

Explore Relation Space

The original (1800d space) already yields interesting properties, such as:

>>> for i in model.most_similar('cardiac__arrest'): print(i)
... 
('cardiac__tamponade', 0.9013814926147461)
('perioperative__complications', 0.8934778571128845)
('heart__failure', 0.885757327079773)
('cardiopulmonary__arrest', 0.8788020610809326)
('resuscitation__arrest', 0.8765194416046143)
('hypothermia__arrest', 0.873295247554779)
('arrhythmic__death', 0.8706731796264648)
('postoperative__complications', 0.8702256679534912)
('pacing__resynchronization', 0.8638449907302856)
('hyponatremia__tamponade', 0.8615702390670776)

Run Similarity Experiment

TO-DO: Write Readme

Run Classification Experiment

TO-DO: Write Readme

Using SeVeN vectors with Keras

Preliminary interface available at seven.py and seven_main.py.


For further details about the construction of this resource and evaluation details, please refer to the following paper:

Espinosa-Anke, L. and Schockaert, S. SeVeN: Augmenting Word Embeddings with Unsupervised Relation Vectors. Coling 2018. Santa Fe. New Mexico.

[1] https://code.google.com/archive/p/word2vec/