HTTPS SSH

README

This is a simple text classification template built on sklearn. It can be easily used by non-programmers. This page is still under construction (particularly optimization etc. have not yet been considered), but the templates can be downloaded and customized with little prior knowledge about machine learning and the specific algorithms used.

The implementation contains 3 templates in the classificationmodules:

  1. runclassifier.py - this is an SVM classifier template that can be used in the presence of separate training and test files.

  2. runcrossvalidation.py - this is also an SVM classifier, but runs 10-fold cross validation. This module is designed for learning the best parameters for the SVMs including the cost and gamma parameters. This can also be used to optimize weights for weighted SVMs.

  3. runtpot.py - This uses the TPOT package (http://www.randalolson.com/2015/11/15/introducing-tpot-the-data-science-assistant/) for identifying the best classifier and parameters given a data set.

The template has been customized (ported) for our various social media text classification research:

Sarker A, O'Connor K, Ginn R, Scotch M, Smith K, Malone D, Gonzalez G.; Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from Twitter, Drug Safety, 2016 Mar;39(3):231-40. doi: 10.1007/s40264-015-0379-4.

(data for this study available at: http://diego.asu.edu/downloads/)

Sarker A, Gonzalez G. DiegoLab16 at SemEval-2016 Task 4: Sentiment Analysis in Twitter using Centroids, Clusters, and Sentiment Lexicons. Submitted to SemEval 2016 on 5th March.

(the current implementation and classes reflect some of the modifications used for this task)

Sarker A, Gonzalez G; Portable Automatic Text Classification for Adverse Drug Reaction Detection via Multi-corpus Training, Journal of Biomedical Informatics, 2015 Feb;53:196-207. doi: 10.1016/j.jbi.2014.11.002. Epub 2014 Nov 8.

(resources for feature extraction for this task can be found at: http://diego.asu.edu/Publications/ADRClassify.html)

Personal Note: I will add the task-independent modules. In the slightly longer term, I will try to provide an even easier interface for non-experts of machine learning (since we are interested in allowing public health professionals and biomedical informaticists to use these modules).