Overview

HTTPS SSH
RoBiRank ======== RoBiRank: RObust BInary classification algorithm for RANKing RoBiRank is a package for learning to rank and latent collaborative retrieval. Please refer to the paper [1] for detailed description of the algorithm. Table of Contents ================= - License - Introduction - Getting Example Data - CROBIRANK - Installation - Prerequisites - Compilation - Input - Execution - Output - ROBIE - Installation - Prerequisites - Compilation - Input - Execution - Output License ======= RoBiRank is under Apache License ver 2.0; see LICENSE file for detailed information. Introduction ============ RoBiRank software package consists of two components. The first component, called `CROBIRANK`, is for conventional learning to rank modeling as described in Section 3 of the paper [1]. The second component, called `ROBIE`, is for latent collaborative retrieval described in Section 4 of the paper [1]. Two components have different requirements; please refer to individual sections for installation guide. Getting Example Data ==================== We provide a Python script which downloads datasets from web and preprocesses them to be used for RoBiRank. To execute the script, first change current directory to `${ROBIRANK_ROOT}/scripts/`, $ cd ${ROBIRANK_ROOT}/scripts/ and then run the script: $ python get_data.py This will download and preprocess following datasets: - LETOR-3.0 ( http://research.microsoft.com/en-us/um/beijing/projects/letor//letor3dataset.aspx ) - MovieLens 1M ( http://grouplens.org/datasets/movielens/ ) CROBIRANK ========= CROBIRANK is short for Conventional RObust BInary RANKing. Installation ------------ ### Prerequisites CROBIRANK only supports UNIX-based systems. It is dependent on following softwares: - C++ compiler with C++98 support (gcc 4.7 and above for example) - Petsc (Portable, Extensible Toolkit for Scientific Computation) - http://www.mcs.anl.gov/petsc/ - TAO (Toolkit for Advanced Optimization) - http://www.mcs.anl.gov/research/projects/tao/ ### Compilation To compile `CROBIRANK`, first move to the path where the source code of `CROBIRANK` is located: $ cd ./crobirank Run 'make' to compile: $ make Input ----- Input data for CROBIRANK has to be in libsvm format. ie. label qid:xyz f1:v1 f2:v2 ... - label takes values in the range [1-5] - xyz represents the query id - f1:v1 represent the <feature id: feature value> pair for a document - Note: Please also remove any comments appearing in the data sample (In LIBSVM format, the comments appear after '#' at the end of the feature-list) For example, the first few lines of letor-3.0/OHSUMED/train.txt look like: 1 qid:1 1:1.000000 2:1.000000 3:0.833333 4:0.871264 5:0 6:0 7:0 8:0.941842 9:1.000000 10:1.000000 11:1.000000 12:1.000000 13:1.000000 14:1.000000 15:1.000000 16:1.000000 17:1.000000 18:0.719697 19:0.729351 20:0 21:0 22:0 23:0.811565 24:1.000000 25:0.972730 26:1.000000 27:1.000000 28:0.922374 29:0.946654 30:0.938888 31:1.000000 32:1.000000 33:0.711276 34:0.722202 35:0 36:0 37:0 38:0.798002 39:1.000000 40:1.000000 41:1.000000 42:1.000000 43:0.959134 44:0.963919 45:0.971425 3 qid:1 1:0.600000 2:0.600000 3:1.000000 4:1.000000 5:0 6:0 7:0 8:1.000000 9:0.624834 10:0.767301 11:0.816099 12:0.934805 13:0.649685 14:0.680222 15:0.686762 16:0.421053 17:0.680904 18:1.000000 19:1.000000 20:0 21:0 22:0 23:1.000000 24:0.401391 25:0.938966 26:0.949446 27:0.984769 28:0.955266 29:1.000000 30:0.997786 31:0.441860 32:0.687033 33:1.000000 34:1.000000 35:0 36:0 37:0 38:1.000000 39:0.425450 40:0.975968 41:0.928785 42:0.978524 43:0.979553 44:1.000000 45:1.000000 1 qid:1 1:0.400000 2:0.400000 3:0.555555 4:0.563658 5:0 6:0 7:0 8:0.545844 9:0.380576 10:0.427356 11:0.468244 12:0.756579 13:0.366316 14:0.360838 15:0.373909 16:0.210526 17:0.479859 18:0.595237 19:0.608701 20:0 21:0 22:0 23:0.613865 24:0.184562 25:0.791539 26:0.863833 27:0.957024 28:0.896468 29:0.941132 30:0.946305 31:0.232558 32:0.507810 33:0.603068 34:0.616847 35:0 36:0 37:0 38:0.614004 39:0.202374 40:0.812801 41:0.868091 42:0.958879 43:0.926045 44:0.944576 45:0.963753 1 qid:1 1:0.200000 2:0.200000 3:0.277778 4:0.281830 5:0 6:0 7:0 8:0.330027 9:0.244258 10:0.303171 11:0.301165 12:0.614995 13:0.255036 14:0.188541 15:0.205136 16:0.368421 17:0.637735 18:0.795453 19:0.802720 20:0 21:0 22:0 23:0.851563 24:0.364249 25:0.897048 26:0.914052 27:0.973615 28:0.923365 29:0.954078 30:0.965119 31:0.348837 32:0.600470 33:0.711821 34:0.718938 35:0 36:0 37:0 38:0.773746 39:0.354506 40:0.871098 41:0.865835 42:0.958123 43:0.917441 44:0.918191 45:0.935256 2 qid:1 1:0.000000 2:0.000000 3:0.000000 4:0.000000 5:0 6:0 7:0 8:0.000000 9:0.000000 10:0.000000 11:0.000000 12:0.000000 13:0.000000 14:0.000000 15:0.000000 16:0.263158 17:0.592453 18:0.328947 19:0.340126 20:0 21:0 22:0 23:0.393538 24:0.258450 25:0.651875 26:0.885764 27:0.964385 28:0.976550 29:0.970655 30:0.984686 31:0.232558 32:0.546115 33:0.295381 34:0.306186 35:0 36:0 37:0 38:0.355037 39:0.230125 40:0.613531 41:0.860290 42:0.956255 43:0.973015 44:0.939784 45:0.960934 Execution --------- Command line arguments needed to run CROBIRANK are: ./ranker -train <path> -test <path> -trunc <ndcg_trunc> -lambda <reg> -alpha <alpha_val> -tol <gnorm_tol> -loss <loss_function> -solver <solver_name> Note: - <gnorm_tol> is the minimum desired gradient norm tolerance value for stopping (for example, 1e-09) - <alpha_val> is a weight parameter for pairwise loss (default value is set to 1.0) - <loss_function> is one from the list: {ROBIRANK=0, INFINITEPUSH=1, LOGISTIC=2, IDENTITY=3, TLOGISTIC=4}. Select ROBIRANK (=0) to run CROBIRANK. The other loss functions are some baselines and competitor algorithms we used to compare CROBIRANK against (more details are available in the paper [1]). - <solver_name> is one from the list: {LBFGS=0, BMRM=1} (default solver is set to LBFGS) For example, to run CROBIRANK on the letor-4.0/MQ2007 dataset, the command would be: $ ./ranker -train /group/ml/data/mc/letor-4.0/MQ2007/Fold1/train2.txt -test /group/ml/data/mc/letor-4.0/MQ2007/Fold1/test2.txt -trunc 20 -lambda 0.001 -alpha 1.0 -tol 0.000001 -loss 0 Output ------ CROBIRANK outputs NDCG@K (values upto truncation level 20) as well as the Overall NDCG (without considering any truncation) at the end of each iteration, as follows: Iter: 8, f: 0.525547, gnorm: 2.42311e-05 nz: 5 num_fn_eval 9 CPU: 3.56 Wallclock: 3.56949 grad_t: 2.81 matvec_t 0.2 Tao Termination Reason: 1 Average NDCG@1: 0.660431 Average NDCG@2: 0.672977 Average NDCG@3: 0.676301 Average NDCG@4: 0.686818 Average NDCG@5: 0.697407 Average NDCG@6: 0.705531 Average NDCG@7: 0.716439 Average NDCG@8: 0.723835 Average NDCG@9: 0.727928 Average NDCG@10: 0.734309 Average NDCG@11: 0.741103 Average NDCG@12: 0.749144 Average NDCG@13: 0.755709 Average NDCG@14: 0.762096 Average NDCG@15: 0.768252 Average NDCG@16: 0.774602 Average NDCG@17: 0.780825 Average NDCG@18: 0.787085 Average NDCG@19: 0.793157 Average NDCG@20: 0.797986 Overall NDCG: 0.885386 Total CPU time: 3.64 seconds Total wallclock time: 3.65344 seconds Obj grad CPU time: 2.81 seconds Obj grad wallclock time: 2.81471 seconds ROBIE ===== ROBIE is short for RObust BInary classification Embedding. Installation ------------ ### Prerequisites Currently, ROBIE only supports UNIX-based systems. It is dependent on following softwares: - Modern C++ compiler with C++11 support (gcc 4.7 and above for example) - MPI library, with multi-threading support (MPICH2 and MVAPICH2 are recommended) - Intel Thread Building Block (at least 4.2) - CMake (at least 2.6) - Boost library (at least 1.49) ### Compilation To compile `ROBIE`, first move to the path where the source code of `ROBIE` is located: $ cd ./robie Then, run CMake to generate Makefiles: $ cmake . If CMake was successful, you can run 'make' to compile: $ make Input ----- To input data to ROBIE, you should prepare a directory with three data files; metadata file, training data file, and test data file. Please refer to `./data/ml1m` for example. The metadata file has to be named `meta`, and should be formatted as: ${number of users} ${number of items} ${number of training data points} ${name of training data file} ${number of test data points} ${name of test data file} For example, `./data/ml1m/meta` is written: 6040 3952 900188 ml1m-train.txt 100021 ml1m-test.txt Each line of training and test data file should consist of observed user index and item index pairs, separated by space. For example, `./data/ml1m/ml1m-train.txt` looks like: 1 1 1 48 1 145 1 254 The first column is user index, and the second column is item index. Note that we use 1-based index, so the user indices range from `1` to `${number of users}`, and item indices range from `1` to `${number of items}`. Execution --------- To look at options for RoBiRank, run $ ./robie --help processor name: Hyokun-Yuns-Machine.local, number of tasks: 1, rank: 0 robie options:: --help produce help message --path arg (=../data/ml1m) path of data --nthreads arg (=4) number of threads for each MPI process --lrate arg (=0.001) value of learning rate parameter --reg arg (=0.050000000000000003) value of regularization parameter --dim arg (=100) dimensionality of latent space --iter arg (=1000) number of iterations --seed arg (=12345) random number generator seed value --period arg (=100) period of xi variable updates --sam arg (=1000) number of samples to approximate the ranking loss --level arg (=1,2,3,4,5,10,100,200,500) levels precision values will be calculated --loss arg (=logistic) name of the loss (logistic, hinge) --sensitivity arg (=1) sensitivity --sgd_num arg (=1) number of stochastic gradients taken for each data point --dump_prefix arg prefix to the parameter dump file --dump_period arg (=100) period to dump parameters On a single machine, you can simply set `nthreads` parameter to change the number of threads used. In multi-machine environment, you can use MPI to distribute the computation across multiple machines, for example: $ mpirun -np 32 ./robie --path ../data/msd --nthreads 8 This way, you can launch RoBiRank on 32 MPI processes with 8 threads on each. Be careful to make sure that each MPI process is launched on different physical machine, however; in default configuration, many computing clusters launch as many MPI processes on each machine as the number of cores in the machine. Also, note that when working with multi-machine environment it is usually be beneficial to increase the number of stochastic gradients taken per each iteration such that the inter-machine communication cost shall be amortized. For million song dataset (MSD) experiment in the paper, we used: $ mpirun -np 32 ./robie --path ../data/msd --nthreads 14 --sgd_num 1000 --lrate 0.0002 Output ------ ROBIE outputs precision and recall values at the end of each iteration, as follows: iteration: 30 2014-Dec-29 22:37:55 - calculate metrics precision @1: 0.102284, recall @1: 0.00608878 precision @2: 0.0932146, recall @2: 0.0110978 precision @3: 0.0894077, recall @3: 0.0159668 precision @4: 0.0856567, recall @4: 0.0203959 precision @5: 0.0830366, recall @5: 0.0247151 precision @10: 0.0747397, recall @10: 0.0444911 precision @100: 0.039864, recall @100: 0.237303 precision @200: 0.0307785, recall @200: 0.366437 precision @500: 0.0202358, recall @500: 0.6023 You can also choose to dump parameters for users and items, by specifying `--dump_prefix`` parameter. $ ./robie --dump_prefix testdump Once in `--dump_period` number of iterations, user and item parameters are dumped to `{$dump_prefix}_user_{$mpi_process_number}_{$iteration}.txt` and `{$dump_prefix}_item_{$iteration}.txt` files. The files are comma-separated, as follows: $ head -2 testdump_user_0_0.txt 4481,0.35763,0.400443,0.689383,0.559736,0.574451,0.207691,0.0286627,0.688924,0.469343,0.207153,0.00393225,0.0130297,0.420422,0.616191,0.894884,0.410636,0.260769,0.0267289,0.107942,0.366684,0.00388677,0.940092,0.691276,0.918666,0.419732,0.804096,0.598631,0.00344757,0.488318,0.882205,0.934262,0.985686,0.431288,0.128012,0.870147,0.771544,0.700178,0.644223,0.635745,0.997823,0.421686,0.120909,0.116037,0.424631,0.171976,0.102715,0.239245,0.975811,0.348276,0.652097,0.530479,0.360154,0.878255,0.0815044,0.495509,0.342771,0.472669,0.696106,0.0742215,0.407483,0.196762,0.662082,0.877149,0.194233,0.199886,0.49071,0.939058,0.411113,0.800496,0.171114,0.466184,0.938207,0.402415,0.984141,0.140555,0.439857,0.895541,0.676024,0.966604,0.411874,0.222692,0.727925,0.502807,0.0491344,0.368421,0.359981,0.738529,0.92369,0.00989853,0.152864,0.884072,0.126909,0.241311,0.442779,0.863666,0.840627,0.261057,0.893876,0.856575,0.0944128 3865,0.97846,0.969047,0.0932573,0.656886,0.79098,0.00792599,0.537418,0.236619,0.605218,0.582318,0.142386,0.0447971,0.663991,0.346602,0.975548,0.639047,0.864669,0.950879,0.20371,0.989726,0.422227,0.515145,0.439087,0.572854,0.403268,0.444545,0.0621456,0.00783754,0.43264,0.0808219,0.812791,0.601973,0.353686,0.649357,0.0940712,0.533887,0.517416,0.866755,0.929796,0.871021,0.977452,0.685531,0.489648,0.646825,0.166647,0.699137,0.77512,0.370826,0.353945,0.111347,0.284425,0.675537,0.220483,0.737389,0.759212,0.550829,0.485901,0.406568,0.801666,0.114983,0.861791,0.12048,0.767589,0.441922,0.430928,0.841202,0.978869,0.788125,0.579014,0.775424,0.751146,0.0356785,0.92553,0.286392,0.580934,0.233389,0.320928,0.0166644,0.627389,0.796981,0.185464,0.458001,0.0958413,0.418092,0.784017,0.370937,0.82567,0.461983,0.747665,0.244822,0.147145,0.443867,0.830352,0.49041,0.670371,0.425406,0.952386,0.996727,0.59511,0.561171 The first column is the index of the user, and the rest of the line is vector representation of the user. Item parameter file is formatted in the same manner. [1] Ranking via Robust Binary Classification (Hyokun Yun, Parameswaran Raman, S.V.N. Vishwanathan, NIPS 2014) http://papers.nips.cc/paper/5363-ranking-via-robust-binary-classification