HTTPS SSH

Source code for blog post "Skip-gram negative sampling as (unshifted) PMI matrix factorization".

Files

  • exp_word2vec.py: main file implementing the experiment
  • word2vec-XXX: word2vec implementations
  • results.ods: spreadsheet and chart of the results

Reproducibility notes

0. Dependencies

The experiment requires these library:

  • numpy
  • gensim
  • nltk (with WordNet 3.0)

1. Obtain data and update paths

Download, extract, and concat 1 billion words benchmark:

wget http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
tar -xzvf 1-billion-word-language-modeling-benchmark-r13output.tar.gz
cat 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/* > training.txt

Open exp_word2vec.py and put the path to the newly created file into variable train_ds_path.

2. Run the experiment

Simply run ./exp_all.sh. It may take 1-2 days to finish, depends on your machine. First results should look like:

Wed Oct 14 16:01:56 CEST 2015

***
Trying k=5, type=original

Starting training using file /home/minhle/ssd/training.txt
Vocab size: 552403
Words in train file: 796188544
Alpha: 0.000002  Progress: 100.00%  Words/thread/sec: 77.36k
Latest scores: SimLex=0.374, MEN=0.717, WordSim=0.659
Starting training using file /home/minhle/ssd/training.txt
Vocab size: 552403
Words in train file: 796188544
Alpha: 0.000002  Progress: 100.00%  Words/thread/sec: 80.68k
Latest scores: SimLex=0.372, MEN=0.715, WordSim=0.658
Starting training using file /home/minhle/ssd/training.txt
Vocab size: 552403
Words in train file: 796188544
Alpha: 0.000002  Progress: 100.00%  Words/thread/sec: 78.31k
Latest scores: SimLex=0.372, MEN=0.716, WordSim=0.658
Starting training using file /home/minhle/ssd/training.txt
Vocab size: 552403
Words in train file: 796188544
Alpha: 0.000002  Progress: 100.00%  Words/thread/sec: 81.21k
Latest scores: SimLex=0.370, MEN=0.713, WordSim=0.663
Starting training using file /home/minhle/ssd/training.txt
Vocab size: 552403
Words in train file: 796188544
Alpha: 0.000002  Progress: 100.00%  Words/thread/sec: 78.23k
Latest scores: SimLex=0.372, MEN=0.715, WordSim=0.660

Results for k=5, type=original:
SimLex  MEN WordSim
Raw scores:
0.374   0.717   0.659
0.372   0.715   0.658
0.372   0.716   0.658
0.370   0.713   0.663
0.372   0.715   0.660