Wiki

Clone wiki

ApicoAPCS / Home

Automated Training for a Machine Learning Algorithm: A Pipeline for ApicoAP

Detailed instructions for the use of the ApicoAP-CS software is available as a video and as Supplementary User Manual, which can be found under the Wiki link.

ApicoAP Complete Suite (ApicoAP-CS) is an implementation of ApicoAP Pipeline which employs ApicoAP training routine in a pipeline where the train data gathering as well as the training procedures are automated. ApicoAP is a classification model that trains classifiers to identify apicoplast-targeted proteins in apicomplexan species. More information can be found on ApicoAP at https://bitbucket.org/wsu_bcb/apicoap. ApicoAP Pipeline complements the function of ApicoAP by providing a tool to automatically generate species-specific classifiers, taking the train data gathering burden off the shoulders of the researcher.

ApicoAP-CS is implemented as collection of web services and the client software is provided here. Here is the abstract of the paper in which we discussed ApicoAP Pipeline:

Abstract

Motivation: Supervised machine learning applications are used by life scientists for a variety of objectives including the detection of targeting sequences and the prediction of transmembrane domain topology. Expert-curated public gene and protein databases are major resources for gathering data to train these applications. While these data resources are continuously updated by the addition of new information, most machine learning algorithms are trained once at the time of their publication using data sets that are outdated not long after their introduction.

Methodology/Principal Findings: In this paper, we propose a new model of operation for specific supervised machine learning algorithms that learn from genomic data. By defining these algorithms in a pipeline in which the training data gathering procedure as well as the learning process is automated, one can have a system that functions as a classifier or predictor generator that does not require training data to be provided, but instead is capable of generating a model from the information available from public resources at a given time. Due to the divergence in data requirements and dataset curation procedures, the proposed model of operation is explained using a case study where an existing machine learning model, ApicoAP, is utilized in a pipeline. The ApicoAP Pipeline is capable of generating classifiers for different apicomplexan species, without requiring training data to be provided.

Conclusions/Significance: Given that the vast majority of the procedures described for gathering training data can easily be automated, it is possible to transform valuable machine learning applications into self-evolving learners that adapt to the ever-changing data available for genes and proteins and to develop new machine learning applications that are similarly capable. This generic idea is applied to the apicoplast-targeted protein prediction problem to create the ApicoAP Pipeline. An implementation of this pipeline as a collection of web services is available, and the client software can be found at https://bitbucket.org/wsu_bcb/apicoapcs/downloads.

Please report any bugs you encountered using the software to gokcen.cilingir@gmail.com

Reference Cilingir, Gokcen, and Shira L. Broschat. "Automated Training for Algorithms That Learn from Genomic Data." BioMed research international 2015 (2015).

Updated