HTTPS SSH

README

Implementation of Combining Instance and Feature neighbors for Efficient Multi-label Classification, by Len Feremans, data scientist at the University of Antwerp (Belgium) within the ADReM research group.

Abstract paper published in 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

"Multi-label classification problems occur naturally in different domains. For example, within text categorization the goal is to predict a set of topics for a document, and within image scene classification the goal is to assign labels to different objects in an image. In this work we propose a combination of two variations of k nearest neighborhoods (kNN) where the first neighborhood is computed instance (or row) based and the second neighborhood is feature (or column) based. Instance based kNN is inspired by user-based collaborative filtering, while feature kNN is inspired by item-based collaborative filtering. Finally we apply a linear combination of instance and feature neighbors scores and apply a single threshold to predict the set of labels. Experiments on various multi-label datasets show that our algorithm outperforms other state-of-the-art methods such as ML-kNN, IBLR and Binary Relevance with SVM, on different evaluation metrics. Finally our algorithm uses an inverted index during neighborhood search and scales to extreme datasets that have millions of instances, features and labels."

What is this repository for?

You can use our algorithm as an efficient way to predict or suggest new labels for new data, based on training data. The algorithm can handle sparse datasets with millions of instances, features and labels (set -Xmx=30G).

Quick start

The implementation is written in Java and uses Maven for compilation. To compile the project using Maven:

mvn clean install -DskipTests=True

To run this algorithm on a dataset commandline, use:

java be.uantwerpen.lcif_knn.LCIFCommandLine TRAIN_DATASET.libsvm TEST_DATASET.libsvm lcif -k1 INSTANCE_K -k2 FEATURE_K -l LAMBDA -t THRESHOLD

Description of all parameters is found in the paper, but in summary:

  • TRAIN_DATASET and TEST_DATASET should be in libsvm format. At this time we only support binary features.
  • METHOD is either: instance-knn/feature-knn/second-instance-knn/lcif or lcif2. By default we recommend using lcif.
  • PARAMETER k1: number of instance-based neighbours.
  • PARAMETER k2: number of feature-based neighbours.
  • PARAMETER lambda: controls linear combination of predictions scores, 1.0 = predictions are fully instance-based, 0.0 = predictions are fully feature-based.
  • PARAMETER threshold: controls cut-off value.

For estimating parameters, we suggest you use LCIFExperimenter, which performs a 10-fold cross-validation on the training data to estimate optimal parameters for k1, k2, lambda and threshold.

More information for researchers and contributors

The current version is 0.9.0

We depend on the following projects: com.zaxxer.SparseBitSet, mulan, weka, com.google.guava, org.apache.commons and commons-io.

To reproduce the experiments: Experimental scripts are coded as unit tests in the packages be.uantwerpen.experiments.extreme and be.uantwerpen.experiments.large. Datasets are not provided, but can be downloaded from mulan, meka and the extreme repository.

To contribute to this project: The design of the core code (in be.uantwerpen.lcif_knn) is straightforward and you could extend this codebase for research.
For example to research the effect of alternative thresholding functions, alternative functions to compute the prediction score based on the neighbors and so one.

API

There is one data class LCIFDATA that contains all training and test data using an nested array of integers, as wel as the inverted index. For all methods (and sub-methods) there a seperate class to search for all nearest neighbors (RowKNNSearch), a data class to save/load or represent this set (RowKNNCache), and a class to make predictions (RowKNNPredictions).

Sample code for running instance-based k-nearest neighbors for multiple-label classification:

LCIFData data = new LCIFData(training,test);
//make neighbours
File neighborFile = new File("./temp/neigbours.txt");
RowKNNSearch search = new RowKNNSearch();
search.computeAllSimilarRows(data, 3, neighborFile); //saved to disk
RowKNNCache cache = new RowKNNCache();
cache.load(neighborFile);
//make predictions
File prediction = new File("./temp/predictions.txt");
RowKNNPredictions predictions = new RowKNNPredictions();
predictions.makePredictionAllTestRows(data, cache, 2, prediction);
//evaluate
LCIFPredictions predictionsData = new LCIFPredictions();
predictionsData.load(prediction);
LCIFEvaluation evaluator = new LCIFEvaluation();
evaluator.printBestPredictionAssumingOracle(data, predictionsData);

Sample code for running lcif, and estimating optimal parameters using 10-fold cross-validated gridsearch:

//Inner loop: find optimal parameters
int[] kRowRange = MathUtils.makeRange(1, 31, 2);
int[] kFeatureRange = MathUtils.makeRange(1, 31, 2);
double[] lambdaRange = MathUtils.makeRangeIncl(0.0, 1.0, 0.01);
int folds = 10;
File bestParametersOutput = new File('./gridsearch.txt')
LCIFExperimenter experimenter = new LCIFExperimenter();
List<LCIFResult> results = experimenter.runLCIFGridSearchFolded(trainLibsvm, folds, kRowRange, kFeatureRange, lambdaRange); 
experimenter.reportParameters(trainLibsvm, testLibsvm, results, bestParametersOutput);
//Outer loop: evaluate algorithm on test data
List<LCIFResult> resultsGS = experimenter.parseResults(bestParametersOutput);
List<LCIFResult> outerResults = experimenter.runOuterLoop(trainLibsvm, testLibsvm, resultsGS);
for(LCIFResult res: outerResults){
     System.out.format("LCIF;kRow=%d;kFeature=%d;lambda=%.3f;t=%.3f, %s=%.3f\n", 
      res.kRow, res.kFeature, res.lambda, res.optimalTreshold, res.evaluationValues.get(0).name, res.evaluationValues.get(0).value);
}

Who do I talk to?

E-mail Len Feremans (gmail) for more information.

Licence

Copyright (c) [2017] [Universiteit Antwerpen - Len Feremans]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.