HTTPS SSH

Requirements

pip install -r requirements.txt

Get RCV1 set

wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt0.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt1.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt2.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt3.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_train.dat.gz

Get terms set

wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a14-term-dictionary/stem.termid.idf.map.txt

Get categories

wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a08-topic-qrels/rcv1-v2.topics.qrels.gz

Gunzip everything

gunzip *.gz

Generate dictionary.txt

./get_dictionary.sh

Parse data to vectors

./get_data.sh

or generate data to one file

./get_data.sh outputfile

Get 10000 random vectors for test

sort -R file | head -n 10000 > output

Generate RSM model based on dataset

python rsm.py train model dictionary
options: -H hiddens number of hidden variables (default = 50)
         -N epochs  number of learning epochs (default = 1)
         -n iter    iterations of contrastive divergence (default = 1)
         -b batch   number of batch size (default = 1)
         -r rate    learning rate (default = 0.001)

Generate Precision/Recall chart

python stats.py rcv1-v2.topics.qrels model_parsed parsed.new.txt dictionary.txt