HTTPS SSH

SeVeN: Semantic Vector Networks

picture

Welcome to the home page of SeVeN, Semantic Vector Networks. SeVeN is a resource that aims at bridging the gap between discrete relation labels that may be found in resources like WordNet and attributional and relational information naturally encoded in word embeddings.

SeVeN is a semantic network, but each edge is itself a vector. The current version is derived from building a ~1M edge graph from the English Wikipedia by leveraging pairwise PPMI word associations. Then, a relation vector for each pair of words is learned. Finally, each relation vector is compressed and purified with an autoencoder architecture, reducing the size up to only 10 dimensions.

The current release is based on the Google News word2vec embeddings [1].

  • Download the 1800d original vectors from here. Below is an output sample.
sevenlong.most_similar('roman_numerals')[0]
>>> ('arabic_numerals', 0.9820329546928406)
sevenlong.most_similar('french_revolution')[0]
>>> ('1789_revolution', 0.9796866178512573)
sevenlong.most_similar('netflix_streaming')[0]
>>> ('hulu_streaming', 0.9768953323364258)
  • Download the 10d purified vectors from here. Below is an output sample.
seven.most_similar('roman_numerals')[0]
>>> ('arabic_alphabet', 0.9980242848396301)
seven.most_similar('french_revolution')[0]
>>> ('cuban_revolution', 0.9987969398498535)
seven.most_similar('netflix_streaming')[0]
>>> ('playstation_console', 0.9971717596054077) 

A Working Example

In this tutorial we assume a large text file as initial corpus already tokenized. We will use as example the biomedical corpus provided in the SemEval 2018 task on Hyeprnym Discovery, a 130M word corpus consisting on abstracts and full papers from pubmed. This corpus has 3,239,945 lines.

Get cooc matrix

python3 src/preprocess/_get_coocs.py -c corpus_file -b build_folder -v 10000 -win 10 -sw english_stopwords.txt 

where...

-   -c corpus file
-   -b build folder
-   -v vocabulary of the most frequent words to be considered
-   -win window size (left and right)
-   -sw optional argument of stopwords file (one per line)

This step generates a number of files in the build folder, e.g. raw and weighted cooc matrix, "triples" files (center, context, cooc score), etc.

Compute PMI

python3 src/preprocess/_cooc2pmi.py -d build_folder/weighted_cooc_matrix.pkl -rd build_folder/raw_cooc_matrix.pkl -n build_folder/N_vals.txt -b build_folder -t 100 -wid build_folder/words2ids.txt -mc 100

where...

-   -d pickled dictionary with weighted coocs
-   -rd pickled dictionary with raw coocs
-   -n text file containint the sum of the weighted and raw cooc matrices
-   -b build dir
-   -t top k context words (sorted by PMI score)
-   -wid words to id mapping file
-   -mc mininum coocurrence (to filter out relations with high pmi but low corpus evidence)

This step produces a file ppmi_pairs_topk=n.tsv, where n is the -t argumnet. It also produces a _filtered.txt file, where word pairs with lower cooc than the threshold set at -mc are discarded. Note that the current version adds a smoothing of 0.75 to the context word distribution.

Get contexts

This step acquires contexts for six different positions (left, mid, right and reversed) for each relation vector.

python3 src/preprocess/_get_contexts.py -p build_dir/ppmi_pairs_topk=100.tsv_filtered.txt -b build_dir -mw 5 -sw 5

where...

-   -p selected pairs file
-   -b build dir
-   -mw mid word window
-   -sw side (left and right) word window

Vectorize

Vectorize all contexts into a vector space model of dimensionality 6*d, where d is the size of the pretrained embedding of choice.

python3 src/preprocess/_vectorize.py -wv word_vectors -p build_dir/ppmi_pairs_topk=100.tsv_filtered.txt -b build_dir

where...

-   -wv word vectors file
-   -p selected pairs file
-   -b bulid dir

This step produces a vector file named relation_vectors__pretrainedwv=[pretrainedfile].

Autoencode Relation Vectors

In order to 'purify' and reduce the dimensionality of the original d*6 relation vectors, we run them through an autoencoder architecture. The script takes as input a relational vector space model and produces, for different dimensionalities, compressed representations after running them through different autoencoder architectures. The architecture used in the Coling paper produces the models ending in _forget.vec. An additional autoencoded model is also generated for each hidden dimension (ending in _regular.vec), which is a vanilla autoencoder where input and reconstructed output are the same.

python3 src/preprocess/_autoencoder.py -rv relation_vectors -wv word_vectors -b build_dir

where...

-   -rv relation vectors file
-   -wv word vectors file
-   -b bulid dir

This step will produce models of different dimensionalities for a vanilla autoencoder, and the one (more relational) described in the paper.

Explore Relation Space

The original (1800d space) already yields interesting properties, such as:

>>> for i in model.most_similar('cardiac__arrest'): print(i)
... 
('cardiac__tamponade', 0.9013814926147461)
('perioperative__complications', 0.8934778571128845)
('heart__failure', 0.885757327079773)
('cardiopulmonary__arrest', 0.8788020610809326)
('resuscitation__arrest', 0.8765194416046143)
('hypothermia__arrest', 0.873295247554779)
('arrhythmic__death', 0.8706731796264648)
('postoperative__complications', 0.8702256679534912)
('pacing__resynchronization', 0.8638449907302856)
('hyponatremia__tamponade', 0.8615702390670776)

Run Similarity Experiment

TO-DO: Write Readme

Run Classification Experiment

TO-DO: Write Readme


For further details about the construction of this resource and evaluation details, please refer to the following paper:

Espinosa-Anke, L. and Schockaert, S. SeVeN: Augmenting Word Embeddings with Unsupervised Relation Vectors. Coling 2018. Santa Fe. New Mexico.

[1] https://code.google.com/archive/p/word2vec/


AI Wales Meetup - Download SeVeN's tensorboard files from here: https://drive.google.com/drive/folders/1lvtydA54XItEL1OJ38e2wvpPUT-2XAO6?usp=sharing