HTTPS SSH

Dichotomy classifier

Works using this code should acknowledge the following publications:

  1. Monaco, John V., et al. "Developing a keystroke biometric system for continual authentication of computer users." Intelligence and Security Informatics Conference (EISIC), 2012 European. IEEE, 2012.
  2. Monaco, John V., et al. "Recent Advances in the Development of a Long-Text-Input Keystroke Biometric Authentication System for Arbitrary Text Input." Intelligence and Security Informatics Conference (EISIC), 2013 European. IEEE, 2013.

Dependencies

These scripts depend on the following Python libraries:

pandas
numpy
matplotlib
scikit-learn

Usage

The main script should be run by

$ python main.py

This will list the available commands. There are commands to obtain authentication accuracy using a reference and query set of samples, perform a leave-one-out cross-validation (LOOCV), and repeated random subsampling (RRS). The simplest is the auth command, shown here:

usage: python main.py auth [-h] [--k K] [--p P] [--max_between_size MAX_BETWEEN_SIZE]
                        [--max_within_size MAX_WITHIN_SIZE] [--out_dir OUT_DIR]
                        [--comment COMMENT] [--label_col LABEL_COL]
                        [--normalize NORMALIZE] [--norm_distance NORM_DISTANCE]
                        reference query

positional arguments:
    reference             Reference set fetaure vectors.
    query                 Query set feature vectors.

optional arguments:
    -h, --help            show this help message and exit
    --k K                 K nearest neighbors.
    --p P                 Minkowsi parameter: p=2 for Euclidean distance.
    --max_between_size MAX_BETWEEN_SIZE
                        Maximum size for between class difference space
    --max_within_size MAX_WITHIN_SIZE
                        Maximum size for within class difference space
    --out_dir OUT_DIR     Output directory
    --comment COMMENT     Description of the experiment
    --label_col LABEL_COL
                        Column containing the class labels
    --normalize NORMALIZE
                        Normalization method
    --norm_distance NORM_DISTANCE
                        Normalization distance

Feature files

Feature files should be csv files. Each row contains the user and session followed by the feature vector. The first two columns should be 'user' and 'session', although the class labels can be specified by the --label_col argument. For example:

Vinnie,098aoe0-a90au8au8, 0.23, 0.144, 0.234, 0.89...

Feature vectors do not have to be normalized.

Dichotomy model classification

The feature vectors are loaded and transformed into a feature difference space. There are 2 types of feature difference vectors.

Within class: difference vectors between samples of a single user Between class: difference vectors between samples of different users.

The problem of authentication is mapped to a 2-class classification. An unknown sample may be labeled as either within or between-class.

Process for authenticating an unkown sample claiming to be X:

Training space: First, the training feature difference space is created. The within class difference vectors are created by taking the differences of every 2-combination of samples from X.

The between-class difference vectors are created by taking the differences from every sample to every other sample of different users. Since this is a very large space, the number of between-samples should be limited for large datasets.

Testing space: Difference vectors are taken between the unknown sample and the user in question. These need to be classified as within or between class.

If user X has M samples, then there will be M unknown test difference vectors. The distance between each test vector and every train vector is taken. This results in M lists of neighbors, which are merged.

The closest neighbors to the original sample in question are now known. These are either within or between class differences. A linear weight is assigned to each difference vector, and the weights of the within-class vectors make up the classifier output score.

Output

The output of each experiment includes a summary, in metadata.csv, graphs of the error rates and ROC curve, coordinates for the ROC curve, and decisions for each authentication performed. By default, the script will try to save the results in an experiments folder in the user's home directory.