1. Hyokun Yun
  2. robirank

Overview

HTTPS SSH

RoBiRank

RoBiRank: RObust BInary classification algorithm for RANKing

RoBiRank is a package for learning to rank and latent collaborative retrieval. Please refer to the paper [1] for detailed description of the algorithm.

Table of Contents

  • License
  • Introduction
  • Getting Example Data
  • CROBIRANK
    • Installation
      • Prerequisites
      • Compilation
    • Input
    • Execution
    • Output
  • ROBIE
    • Installation
      • Prerequisites
      • Compilation
    • Input
    • Execution
    • Output

License

RoBiRank is under Apache License ver 2.0; see LICENSE file for detailed information.

Introduction

RoBiRank software package consists of two components. The first component, called CROBIRANK, is for conventional learning to rank modeling as described in Section 3 of the paper [1]. The second component, called ROBIE, is for latent collaborative retrieval described in Section 4 of the paper [1]. Two components have different requirements; please refer to individual sections for installation guide.

Getting Example Data

We provide a Python script which downloads datasets from web and preprocesses them to be used for RoBiRank. To execute the script, first change current directory to ${ROBIRANK_ROOT}/scripts/,

$ cd ${ROBIRANK_ROOT}/scripts/

and then run the script:

$ python get_data.py

This will download and preprocess following datasets:

CROBIRANK

CROBIRANK is short for Conventional RObust BInary RANKing.

Installation

Prerequisites

CROBIRANK only supports UNIX-based systems. It is dependent on following softwares:

Compilation

To compile CROBIRANK, first move to the path where the source code of CROBIRANK is located:

$ cd ./crobirank

Run 'make' to compile:

$ make

Input

Input data for CROBIRANK has to be in libsvm format.

ie.

label qid:xyz f1:v1 f2:v2 ...
  • label takes values in the range [1-5]
  • xyz represents the query id
  • f1:v1 represent the <feature id: feature value> pair for a document
  • Note: Please also remove any comments appearing in the data sample (In LIBSVM format, the comments appear after '#' at the end of the feature-list)

For example, the first few lines of letor-3.0/OHSUMED/train.txt look like:

1 qid:1 1:1.000000 2:1.000000 3:0.833333 4:0.871264 5:0 6:0 7:0 8:0.941842 9:1.000000 10:1.000000 11:1.000000 12:1.000000 13:1.000000 14:1.000000 15:1.000000 16:1.000000 17:1.000000 18:0.719697 19:0.729351 20:0 21:0 22:0 23:0.811565 24:1.000000 25:0.972730 26:1.000000 27:1.000000 28:0.922374 29:0.946654 30:0.938888 31:1.000000 32:1.000000 33:0.711276 34:0.722202 35:0 36:0 37:0 38:0.798002 39:1.000000 40:1.000000 41:1.000000 42:1.000000 43:0.959134 44:0.963919 45:0.971425
3 qid:1 1:0.600000 2:0.600000 3:1.000000 4:1.000000 5:0 6:0 7:0 8:1.000000 9:0.624834 10:0.767301 11:0.816099 12:0.934805 13:0.649685 14:0.680222 15:0.686762 16:0.421053 17:0.680904 18:1.000000 19:1.000000 20:0 21:0 22:0 23:1.000000 24:0.401391 25:0.938966 26:0.949446 27:0.984769 28:0.955266 29:1.000000 30:0.997786 31:0.441860 32:0.687033 33:1.000000 34:1.000000 35:0 36:0 37:0 38:1.000000 39:0.425450 40:0.975968 41:0.928785 42:0.978524 43:0.979553 44:1.000000 45:1.000000
1 qid:1 1:0.400000 2:0.400000 3:0.555555 4:0.563658 5:0 6:0 7:0 8:0.545844 9:0.380576 10:0.427356 11:0.468244 12:0.756579 13:0.366316 14:0.360838 15:0.373909 16:0.210526 17:0.479859 18:0.595237 19:0.608701 20:0 21:0 22:0 23:0.613865 24:0.184562 25:0.791539 26:0.863833 27:0.957024 28:0.896468 29:0.941132 30:0.946305 31:0.232558 32:0.507810 33:0.603068 34:0.616847 35:0 36:0 37:0 38:0.614004 39:0.202374 40:0.812801 41:0.868091 42:0.958879 43:0.926045 44:0.944576 45:0.963753
1 qid:1 1:0.200000 2:0.200000 3:0.277778 4:0.281830 5:0 6:0 7:0 8:0.330027 9:0.244258 10:0.303171 11:0.301165 12:0.614995 13:0.255036 14:0.188541 15:0.205136 16:0.368421 17:0.637735 18:0.795453 19:0.802720 20:0 21:0 22:0 23:0.851563 24:0.364249 25:0.897048 26:0.914052 27:0.973615 28:0.923365 29:0.954078 30:0.965119 31:0.348837 32:0.600470 33:0.711821 34:0.718938 35:0 36:0 37:0 38:0.773746 39:0.354506 40:0.871098 41:0.865835 42:0.958123 43:0.917441 44:0.918191 45:0.935256
2 qid:1 1:0.000000 2:0.000000 3:0.000000 4:0.000000 5:0 6:0 7:0 8:0.000000 9:0.000000 10:0.000000 11:0.000000 12:0.000000 13:0.000000 14:0.000000 15:0.000000 16:0.263158 17:0.592453 18:0.328947 19:0.340126 20:0 21:0 22:0 23:0.393538 24:0.258450 25:0.651875 26:0.885764 27:0.964385 28:0.976550 29:0.970655 30:0.984686 31:0.232558 32:0.546115 33:0.295381 34:0.306186 35:0 36:0 37:0 38:0.355037 39:0.230125 40:0.613531 41:0.860290 42:0.956255 43:0.973015 44:0.939784 45:0.960934

Execution

Command line arguments needed to run CROBIRANK are:

./ranker 
            -train <path> 
            -test <path> 
            -trunc <ndcg_trunc> 
            -lambda <reg> 
            -alpha <alpha_val> 
            -tol <gnorm_tol> 
            -loss <loss_function> 
            -solver <solver_name>

Note:

  • <gnorm_tol> is the minimum desired gradient norm tolerance value for stopping (for example, 1e-09)
  • <alpha_val> is a weight parameter for pairwise loss (default value is set to 1.0)
  • <loss_function> is one from the list: {ROBIRANK=0, INFINITEPUSH=1, LOGISTIC=2, IDENTITY=3, TLOGISTIC=4}. Select ROBIRANK (=0) to run CROBIRANK. The other loss functions are some baselines and competitor algorithms we used to compare CROBIRANK against (more details are available in the paper [1]).
  • <solver_name> is one from the list: {LBFGS=0, BMRM=1} (default solver is set to LBFGS)

For example, to run CROBIRANK on the letor-4.0/MQ2007 dataset, the command would be:

$ ./ranker -train /group/ml/data/mc/letor-4.0/MQ2007/Fold1/train2.txt -test /group/ml/data/mc/letor-4.0/MQ2007/Fold1/test2.txt -trunc 20 -lambda 0.001 -alpha 1.0 -tol 0.000001 -loss 0

Output

CROBIRANK outputs NDCG@K (values upto truncation level 20) as well as the Overall NDCG (without considering any truncation) at the end of each iteration, as follows:

Iter: 8, f: 0.525547, gnorm: 2.42311e-05 nz: 5
num_fn_eval 9 CPU: 3.56 Wallclock: 3.56949 grad_t: 2.81 matvec_t 0.2
Tao Termination Reason: 1
Average NDCG@1: 0.660431
Average NDCG@2: 0.672977
Average NDCG@3: 0.676301
Average NDCG@4: 0.686818
Average NDCG@5: 0.697407
Average NDCG@6: 0.705531
Average NDCG@7: 0.716439
Average NDCG@8: 0.723835
Average NDCG@9: 0.727928
Average NDCG@10: 0.734309
Average NDCG@11: 0.741103
Average NDCG@12: 0.749144
Average NDCG@13: 0.755709
Average NDCG@14: 0.762096
Average NDCG@15: 0.768252
Average NDCG@16: 0.774602
Average NDCG@17: 0.780825
Average NDCG@18: 0.787085
Average NDCG@19: 0.793157
Average NDCG@20: 0.797986
Overall NDCG: 0.885386

Total CPU time: 3.64 seconds
Total wallclock time: 3.65344 seconds
Obj grad CPU time: 2.81 seconds
Obj grad wallclock time: 2.81471 seconds

ROBIE

ROBIE is short for RObust BInary classification Embedding.

Installation

Prerequisites

Currently, ROBIE only supports UNIX-based systems. It is dependent on following softwares:

  • Modern C++ compiler with C++11 support (gcc 4.7 and above for example)
  • MPI library, with multi-threading support (MPICH2 and MVAPICH2 are recommended)
  • Intel Thread Building Block (at least 4.2)
  • CMake (at least 2.6)
  • Boost library (at least 1.49)

Compilation

To compile ROBIE, first move to the path where the source code of ROBIE is located:

$ cd ./robie

Then, run CMake to generate Makefiles:

$ cmake .

If CMake was successful, you can run 'make' to compile:

$ make

Input

To input data to ROBIE, you should prepare a directory with three data files; metadata file, training data file, and test data file. Please refer to ./data/ml1m for example. The metadata file has to be named meta, and should be formatted as:

${number of users} ${number of items}
${number of training data points} ${name of training data file}
${number of test data points} ${name of test data file}

For example, ./data/ml1m/meta is written:

6040 3952
900188 ml1m-train.txt
100021 ml1m-test.txt

Each line of training and test data file should consist of observed user index and item index pairs, separated by space. For example, ./data/ml1m/ml1m-train.txt looks like:

 1 1
 1 48
 1 145
 1 254

The first column is user index, and the second column is item index. Note that we use 1-based index, so the user indices range from 1 to ${number of users}, and item indices range from 1 to ${number of items}.

Execution

To look at options for RoBiRank, run

$ ./robie --help

processor name: Hyokun-Yuns-Machine.local, number of tasks: 1, rank: 0
robie options::
  --help                                produce help message
  --path arg (=../data/ml1m)         path of data
  --nthreads arg (=4)                   number of threads for each MPI process
  --lrate arg (=0.001)                  value of learning rate parameter
  --reg arg (=0.050000000000000003)     value of regularization parameter
  --dim arg (=100)                      dimensionality of latent space
  --iter arg (=1000)                    number of iterations
  --seed arg (=12345)                   random number generator seed value
  --period arg (=100)                   period of xi variable updates
  --sam arg (=1000)                     number of samples to approximate the
                                          ranking loss
  --level arg (=1,2,3,4,5,10,100,200,500)
                                    levels precision values will be
                                    calculated
  --loss arg (=logistic)                name of the loss (logistic, hinge)
  --sensitivity arg (=1)                sensitivity
  --sgd_num arg (=1)                    number of stochastic gradients taken for each data point
  --dump_prefix arg                     prefix to the parameter dump file
  --dump_period arg (=100)              period to dump parameters

On a single machine, you can simply set nthreads parameter to change the number of threads used. In multi-machine environment, you can use MPI to distribute the computation across multiple machines, for example:

$ mpirun -np 32 ./robie --path ../data/msd --nthreads 8

This way, you can launch RoBiRank on 32 MPI processes with 8 threads on each. Be careful to make sure that each MPI process is launched on different physical machine, however; in default configuration, many computing clusters launch as many MPI processes on each machine as the number of cores in the machine. Also, note that when working with multi-machine environment it is usually be beneficial to increase the number of stochastic gradients taken per each iteration such that the inter-machine communication cost shall be amortized. For million song dataset (MSD) experiment in the paper, we used:

$ mpirun -np 32 ./robie --path ../data/msd --nthreads 14 --sgd_num 1000 --lrate 0.0002

Output

ROBIE outputs precision and recall values at the end of each iteration, as follows:

  iteration: 30

  2014-Dec-29 22:37:55 - calculate metrics
  precision @1: 0.102284, recall @1: 0.00608878
  precision @2: 0.0932146, recall @2: 0.0110978
  precision @3: 0.0894077, recall @3: 0.0159668
  precision @4: 0.0856567, recall @4: 0.0203959
  precision @5: 0.0830366, recall @5: 0.0247151
  precision @10: 0.0747397, recall @10: 0.0444911
  precision @100: 0.039864, recall @100: 0.237303
  precision @200: 0.0307785, recall @200: 0.366437
  precision @500: 0.0202358, recall @500: 0.6023

You can also choose to dump parameters for users and items, by specifying `--dump_prefix`` parameter.

$ ./robie --dump_prefix testdump

Once in --dump_period number of iterations, user and item parameters are dumped to {$dump_prefix}_user_{$mpi_process_number}_{$iteration}.txt and {$dump_prefix}_item_{$iteration}.txt files. The files are comma-separated, as follows:

 $ head -2 testdump_user_0_0.txt
 4481,0.35763,0.400443,0.689383,0.559736,0.574451,0.207691,0.0286627,0.688924,0.469343,0.207153,0.00393225,0.0130297,0.420422,0.616191,0.894884,0.410636,0.260769,0.0267289,0.107942,0.366684,0.00388677,0.940092,0.691276,0.918666,0.419732,0.804096,0.598631,0.00344757,0.488318,0.882205,0.934262,0.985686,0.431288,0.128012,0.870147,0.771544,0.700178,0.644223,0.635745,0.997823,0.421686,0.120909,0.116037,0.424631,0.171976,0.102715,0.239245,0.975811,0.348276,0.652097,0.530479,0.360154,0.878255,0.0815044,0.495509,0.342771,0.472669,0.696106,0.0742215,0.407483,0.196762,0.662082,0.877149,0.194233,0.199886,0.49071,0.939058,0.411113,0.800496,0.171114,0.466184,0.938207,0.402415,0.984141,0.140555,0.439857,0.895541,0.676024,0.966604,0.411874,0.222692,0.727925,0.502807,0.0491344,0.368421,0.359981,0.738529,0.92369,0.00989853,0.152864,0.884072,0.126909,0.241311,0.442779,0.863666,0.840627,0.261057,0.893876,0.856575,0.0944128
 3865,0.97846,0.969047,0.0932573,0.656886,0.79098,0.00792599,0.537418,0.236619,0.605218,0.582318,0.142386,0.0447971,0.663991,0.346602,0.975548,0.639047,0.864669,0.950879,0.20371,0.989726,0.422227,0.515145,0.439087,0.572854,0.403268,0.444545,0.0621456,0.00783754,0.43264,0.0808219,0.812791,0.601973,0.353686,0.649357,0.0940712,0.533887,0.517416,0.866755,0.929796,0.871021,0.977452,0.685531,0.489648,0.646825,0.166647,0.699137,0.77512,0.370826,0.353945,0.111347,0.284425,0.675537,0.220483,0.737389,0.759212,0.550829,0.485901,0.406568,0.801666,0.114983,0.861791,0.12048,0.767589,0.441922,0.430928,0.841202,0.978869,0.788125,0.579014,0.775424,0.751146,0.0356785,0.92553,0.286392,0.580934,0.233389,0.320928,0.0166644,0.627389,0.796981,0.185464,0.458001,0.0958413,0.418092,0.784017,0.370937,0.82567,0.461983,0.747665,0.244822,0.147145,0.443867,0.830352,0.49041,0.670371,0.425406,0.952386,0.996727,0.59511,0.561171

The first column is the index of the user, and the rest of the line is vector representation of the user. Item parameter file is formatted in the same manner.

[1] Ranking via Robust Binary Classification (Hyokun Yun, Parameswaran Raman, S.V.N. Vishwanathan, NIPS 2014) http://papers.nips.cc/paper/5363-ranking-via-robust-binary-classification