Works using this code should acknowledge the following publications:
- Monaco, John V., et al. "Developing a keystroke biometric system for continual authentication of computer users." Intelligence and Security Informatics Conference (EISIC), 2012 European. IEEE, 2012.
- Monaco, John V., et al. "Recent Advances in the Development of a Long-Text-Input Keystroke Biometric Authentication System for Arbitrary Text Input." Intelligence and Security Informatics Conference (EISIC), 2013 European. IEEE, 2013.
These scripts depend on the following Python libraries:
pandas numpy matplotlib scikit-learn
The main script should be run by
$ python main.py
This will list the available commands. There are commands to obtain authentication accuracy using a reference and query set of samples, perform a leave-one-out cross-validation (LOOCV), and repeated random subsampling (RRS). The simplest is the
auth command, shown here:
usage: python main.py auth [-h] [--k K] [--p P] [--max_between_size MAX_BETWEEN_SIZE] [--max_within_size MAX_WITHIN_SIZE] [--out_dir OUT_DIR] [--comment COMMENT] [--label_col LABEL_COL] [--normalize NORMALIZE] [--norm_distance NORM_DISTANCE] reference query positional arguments: reference Reference set fetaure vectors. query Query set feature vectors. optional arguments: -h, --help show this help message and exit --k K K nearest neighbors. --p P Minkowsi parameter: p=2 for Euclidean distance. --max_between_size MAX_BETWEEN_SIZE Maximum size for between class difference space --max_within_size MAX_WITHIN_SIZE Maximum size for within class difference space --out_dir OUT_DIR Output directory --comment COMMENT Description of the experiment --label_col LABEL_COL Column containing the class labels --normalize NORMALIZE Normalization method --norm_distance NORM_DISTANCE Normalization distance
Feature files should be csv files. Each row contains the user and session followed by the feature vector. The first two columns should be 'user' and 'session', although the class labels can be specified by the --label_col argument. For example:
Vinnie,098aoe0-a90au8au8, 0.23, 0.144, 0.234, 0.89...
Feature vectors do not have to be normalized.
Dichotomy model classification
The feature vectors are loaded and transformed into a feature difference space. There are 2 types of feature difference vectors.
Within class: difference vectors between samples of a single user
Between class: difference vectors between samples of different users.
The problem of authentication is mapped to a 2-class classification. An unknown sample may be labeled as either within or between-class.
Process for authenticating an unkown sample claiming to be X:
First, the training feature difference space is created. The within class difference vectors are created by taking the differences of every 2-combination of samples from X.
The between-class difference vectors are created by taking the differences from every sample to every other sample of different users. Since this is a very large space, the number of between-samples should be limited for large datasets.
Difference vectors are taken between the unknown sample and the user in question. These need to be classified as within or between class.
If user X has M samples, then there will be M unknown test difference vectors. The distance between each test vector and every train vector is taken. This results in M lists of neighbors, which are merged.
The closest neighbors to the original sample in question are now known. These are either within or between class differences. A linear weight is assigned to each difference vector, and the weights of the within-class vectors make up the classifier output score.
The output of each experiment includes a summary, in
metadata.csv, graphs of the error rates and ROC curve, coordinates for the ROC curve, and decisions for each authentication performed. By default, the script will try to save the results in an
experiments folder in the user's home directory.