HTTPS SSH

README

Implementation of Mining Top-K Quantile-based Cohesive Sequential Patterns, by Len Feremans, data scientist at the University of Antwerp (Belgium) within the ADReM data labs research group.

Published in 2018 SIAM International Conference on Data Mining (SDM18).

"Finding patterns in long event sequences is an important data mining task. Two decades ago research focused on finding all frequent patterns, where the anti-monotonic property of support was used to design efficient algorithms. Recent research focuses on producing a smaller output containing only the most interesting patterns. To achieve this goal, we introduce a new interestingness measure by computing the proportion of the occurrences of a pattern that are cohesive. This measure is robust to outliers, and is applicable to sequential patterns. We implement an efficient algorithm based on constrained prefix-projected pattern growth and pruning based on an upper bound to uncover the set of top-$k$ quantile-based cohesive sequential patterns.We run experiments to compare our method with existing state-of-the-art methods for sequential pattern mining and show that our algorithm is efficient and produces qualitatively interesting patterns on large event sequences."

What is this repository for?

You can use our algorithm as an efficient way to find interesting sequential patterns in timestamped or continuous long event sequences of events (or items, or itemsets if more than even can occur at a single timpoint). Interestingness is defined using quantile-based cohesion.

Informally a sequential pattern is ranked higher, if the percentage of occurrences that are near each other, is high compare to all occurrences.

Quick start

The implementation is written in Java and uses Maven for compilation. To compile the project using Maven:

mvn clean install -DskipTests=True

To run this algorithm on a continuous sequences use. Note that QSCP can also be invoked command-line or from python.

import core.QCSP
//file with spaces to seperate tokens (can be strings), e.g. "this i a sequences file" 
File sequenceFile = new File("your_dataset_data_file.txt");
QCSP algo = new QCSP(sequenceFile, alpha, maxsize, topK);
algo.run(debug, debug_detailed);

To run this algorithm on a timestamped sequences use:

import core.QCSP_Sparse
//file sequenceFileSparse contains timestamp,item_index on each line,e.g.
//2018-01-17 15:30:16,0
//2018-01-17 15:31:16,1
File sequenceFileSparse = new File("your_dataset_data_file_sparse.txt");
//file labelsItems contains labels for each item, e.g.
//beer
//diapers
File labelsItems = new File("labels_for_items.txt")
QCSP_Sparse algo = new QCSP(sequenceFileSparse, labelsItems, long alphaInSeconds, maxsize, topK);
algo.run(debug, debug_detailed);

To run code from python, we have provided wrappers, e.g. use:

PYTHON_PATH = src/main/python

from wrapper.seq_pattern_mining import runQCSP

runQCSP(inputFile, alpha=alpha, maxsize=maxsize, topK=topK, pruningOf=False)

Description of all parameters is found in the paper, but in summary:

  • File is either in timestamped/sparse format, or a single sequence
  • PARAMETER alpha: controls size of window length, e.g. of alpha=2 this means that a pattern occurrence of size K is considered "interesting", if it's size < alpha * k.
  • PARAMETER alphaSeconds: same as alpha, but expressed in seconds, e.g. a pattern occurrence of size K is considered "interesting" if TimestampB - TimestampA < alpha * k
  • PARAMETER maxsize: maximimum size of pattern. We suggest to take small values, e.g. 3, 4 or 5 and then increase size if longer patterns are found, as it will increase runtime.
  • PARAMETER topK: number of sequential patterns to report. We suggest to take value of 100.

Status

Current status: TODO: for sparse version, compute support correct (removing overlapping timestamps) TODO: for sparse version, remove overlapping items for upper bound maxquan

More information for researchers and contributors

The current version is 1.0.0. Created may-oktober 2017, Last updated January 2018.

We depend on the following projects (for comparing with other methods): - SPMF/GoKrimp: http://www.philippe-fournier-viger.com/spmf/ - SKOPUS: https://github.com/fpetitjean/Skopus - FCI: https://bitbucket.org/len_feremans/sequencepatternmining_public

To reproduce the experiments SDM 2018:

PYTHON_PATH = src/main/python
cd experiments_sdm_2018
python run_compare.py
python run_prune_performance.py

To contribute to this project: The design of the core code (in core.QCSP and core.QCSP_Spare) is straightforward and you could extend this codebase for research.

Who do I talk to?

E-mail Len Feremans (gmail) for more information.

Licence

Copyright (c) [2017] [Universiteit Antwerpen - Len Feremans]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.