SeVeN: Semantic Vector Networks
Welcome to the home page of SeVeN, Semantic Vector Networks. SeVeN is a resource that aims at bridging the gap between discrete relation labels that may be found in resources like WordNet and attributional and relational information naturally encoded in word embeddings.
SeVeN is a semantic network, but each edge is itself a vector. The current version is derived from building a ~1M edge graph from the English Wikipedia by leveraging pairwise PPMI word associations.
Then, a relation vector for each pair of words is learned. Finally, each relation vector is compressed and
purified with an autoencoder architecture, reducing the size up to only 10 dimensions.
The current release is based on the Google News word2vec embeddings .
- Download the 1800d original vectors from here. Below is an output sample.
sevenlong.most_similar('roman_numerals') >>> ('arabic_numerals', 0.9820329546928406) sevenlong.most_similar('french_revolution') >>> ('1789_revolution', 0.9796866178512573) sevenlong.most_similar('netflix_streaming') >>> ('hulu_streaming', 0.9768953323364258)
- Download the 10d purified vectors from here. Below is an output sample.
seven.most_similar('roman_numerals') >>> ('arabic_alphabet', 0.9980242848396301) seven.most_similar('french_revolution') >>> ('cuban_revolution', 0.9987969398498535) seven.most_similar('netflix_streaming') >>> ('playstation_console', 0.9971717596054077)
A Working Example
In this tutorial we assume a large text file as initial corpus already tokenized. We will use as example the biomedical corpus provided in the SemEval 2018 task on Hyeprnym Discovery, a 130M word corpus consisting on abstracts and full papers from pubmed. This corpus has 3,239,945 lines.
Get cooc matrix
python3 src/preprocess/_get_coocs.py -c corpus_file -b build_folder -v 10000 -win 10 -sw english_stopwords.txt
- -c corpus file - -b build folder - -v vocabulary of the most frequent words to be considered - -win window size (left and right) - -sw optional argument of stopwords file (one per line)
This step generates a number of files in the
build folder, e.g. raw and weighted cooc matrix, "triples" files (center, context, cooc score), etc.
python3 src/preprocess/_cooc2pmi.py -d build_folder/weighted_cooc_matrix.pkl -rd build_folder/raw_cooc_matrix.pkl -n build_folder/N_vals.txt -b build_folder -t 100 -wid build_folder/words2ids.txt -mc 100
- -d pickled dictionary with weighted coocs - -rd pickled dictionary with raw coocs - -n text file containint the sum of the weighted and raw cooc matrices - -b build dir - -t top k context words (sorted by PMI score) - -wid words to id mapping file - -mc mininum coocurrence (to filter out relations with high pmi but low corpus evidence)
This step produces a file
n is the
-t argumnet. It also produces a
_filtered.txt file, where word pairs with lower cooc than the threshold set at
-mc are discarded.
Note that the current version adds a smoothing of 0.75 to the context word distribution.
This step acquires contexts for six different positions (left, mid, right and reversed) for each relation vector.
python3 src/preprocess/_get_contexts.py -p build_dir/ppmi_pairs_topk=100.tsv_filtered.txt -b build_dir -mw 5 -sw 5
- -p selected pairs file - -b build dir - -mw mid word window - -sw side (left and right) word window
Vectorize all contexts into a vector space model of dimensionality
d is the size of the pretrained embedding of choice.
python3 src/preprocess/_vectorize.py -wv word_vectors -p build_dir/ppmi_pairs_topk=100.tsv_filtered.txt -b build_dir
- -wv word vectors file - -p selected pairs file - -b bulid dir
This step produces a vector file named
Autoencode Relation Vectors
In order to 'purify' and reduce the dimensionality of the original
d*6 relation vectors, we run them through an autoencoder architecture. The script takes as input a relational vector space model and produces, for different dimensionalities, compressed representations after running them through different autoencoder architectures. The architecture used in the Coling paper produces the models ending in
_forget.vec. An additional autoencoded model is also generated for each hidden dimension (ending in
_regular.vec), which is a vanilla autoencoder where input and reconstructed output are the same.
python3 src/preprocess/_autoencoder.py -rv relation_vectors -wv word_vectors -b build_dir
- -rv relation vectors file - -wv word vectors file - -b bulid dir
This step will produce models of different dimensionalities for a
vanilla autoencoder, and the one (more relational) described in the paper.
Explore Relation Space
The original (1800d space) already yields interesting properties, such as:
>>> for i in model.most_similar('cardiac__arrest'): print(i) ... ('cardiac__tamponade', 0.9013814926147461) ('perioperative__complications', 0.8934778571128845) ('heart__failure', 0.885757327079773) ('cardiopulmonary__arrest', 0.8788020610809326) ('resuscitation__arrest', 0.8765194416046143) ('hypothermia__arrest', 0.873295247554779) ('arrhythmic__death', 0.8706731796264648) ('postoperative__complications', 0.8702256679534912) ('pacing__resynchronization', 0.8638449907302856) ('hyponatremia__tamponade', 0.8615702390670776)
Run Similarity Experiment
TO-DO: Write Readme
Run Classification Experiment
TO-DO: Write Readme
For further details about the construction of this resource and evaluation details, please refer to the following paper:
Espinosa-Anke, L. and Schockaert, S. SeVeN: Augmenting Word Embeddings with Unsupervised Relation Vectors. Coling 2018. Santa Fe. New Mexico.
AI Wales Meetup - Download SeVeN's tensorboard files from here: https://drive.google.com/drive/folders/1lvtydA54XItEL1OJ38e2wvpPUT-2XAO6?usp=sharing