This repository contains code for executing the Gene Set Omic Analysis (GSOA) method. GSOA is described in our paper, "Inferring pathway dysregulation in cancers from multiple types of omic data," which is published in Genome Medicine. We have implemented two versions of GSOA: 1) an R script and 2) a Python + bash version. These versions are described below.
We have an R package that can be used to execute GSOA. Users can input data files, and R will read those files and process them (
GSOA_ProcessFiles function). Alternatively, users can read data into R and process these objects (
GSOA function). With either option, very little knowledge of R programming is required. Users can access the R package here. The following example shows how to install this package via the command line (after downloading it to a local directory).
R CMD INSTALL GSOA_0.99.9.tar.gz
Then within the R environment, you could read documentation on the functions for this package using the following commands:
Input Data Files
The required inputs are 1) data file(s) containing omic measurements for each sample, 2) a data file indicating the condition or phenotype status for each sample, and 3) a file that indicates which omic features map to which gene sets.
Data file #1 should use a simple matrix format in which samples represent columns and rows represent omic features (e.g., gene-expression measurements). This file also should contain a header row with an identifier for each sample. Each row should start with a value that indicates a name for the omic feature that is represented. Multiple rows per omic feature may be listed---for example, when a omic-profiling technology produces multiple data values per gene. Values on each row should be separated by tabs.
Sample1 Sample2 Sample3 Sample4 Gene1 0.523 0.991 0.421 0.829 Gene2 8.891 7.673 3.333 9.103 Gene3 4.444 5.551 6.102 0.013
It is possible to input multiple omic data files. These should be separated by commas and/or specified using wildcard characters. For example, you could specify multiple files like this:
When multiple omic files have been specified, samples that do not overlap across all the files will be excluded.
Data file #2 should contain two columns; the first value in each row should be a sample identifier (and should correspond exactly with the identifiers in data file #1), and the second value should indicate which class (e.g., condition or phenotype status) that the sample represents. This file should have no header row. Values on each row should be separated by tabs.
Sample1 Treated Sample2 Treated Sample3 Control Sample4 Control
Alternatively, this can be a CLS file. If a CLS file is used, Data file #1 must be a GCT file.
Data file #3 should be in Gene Matrix Transposed (GMT) format as used by the Molecular Signatures Database. The feature names (e.g., gene symbols or IDs) should be identical to those used in data file #1. For this format, the first value in each row is the gene-set name, the second value is a descriptor, and the remaining values are the genes associated with that gene set. This file should have no header row. Values on each row should be separated by tabs.
GeneSet1 Description... Gene1 Gene2 GeneSet2 Description... Gene2 Gene3 Gene4
When executing this tool, you must specify the above four parameters to scripts/run. Optionally, you may specify the additional parameters described below.
- The number processing cores that should be used when executing the analysis. Default: the code will automatically determine how many cores are on the computer and will use approximately 3/4 of those cores.
- For each gene set, the classification algorithm calculates a probability that each sample belongs to a given class/condition/phenotype. If a file path is specified for this parameter, that file will contain those probabilities. Default: no output file.
- GSOA performs a p-value calculation procedure using randomly selected gene sets. Use this parameter to specify how many random iterations should be used. Default: 100.
- This parameter enables the user to exclude genes from the analysis without having to remove them from the input data files. Specify a comma separated list of gene names that coincide with the row identifiers in data file #1. Default: none.
- The user can specify the number of cross-validation folds. Default: 5. The value "n" can be specified to perform leave-one-out cross validation. In addition, if the number of samples for any class is fewer than the number of folds, leave-one-out cross validation will be used.
- By default the Support Vector Machines algorithm (RBF kernel) is used for classification. With this parameter, the user can specify an alternative classification algorithm. The following options are currently available: svmlinear, svmpoly, svmsigmoid, naivebayes, knn, decisiontree, randomforest. Default: svmrbf. By default, the svm algorithms use a value of 1.0 for the C parameter and 0.0 for gamma. Alternate values can be specified by suffixing the algorithm name with these parameter values. So, for example, if you wanted to use the rbf kernel with a value of 10.0 for C and 1.0 for gamma, you would specify the algorithm as "svmrbf_10.0_1.0". Perhaps better (though more computationally intensive), you can specify "auto" (for example, svmrbf_auto), which will use a grid search to auto-tune the parameters.
In the Python version, the parameters must be specified in order. To use the default value, specify an empty string (""). Below are some examples.
scripts/run ExpressionData.txt ClassValues.txt c2.cp.v4.0.symbols.gmt Results.txt 8 Probabilities.txt 1000 "KRAS,HRAS,NRAS" 10 svmrbf_auto scripts/run ExpressionData.txt ClassValues.txt c2.cp.v4.0.symbols.gmt Results.txt "" "" "" "" "" randomforest
Python / bash Version
GSOA has also been implemented as a series of Python and bash scripts that can be executed via a simple command-line interface. It can be executed on UNIX-based systems (e.g., Linux or Mac OS X). If you want to execute it on Windows (or do not want to install the prerequisite software directly on your system), you can run it within a virtual machine environment (see instructions below). Please contact us if you run into any problems.
Configuration steps for running this version as a standalone application:
Install Python 2.6+ (http://www.python.org) [Python 3 is not supported]
Install NumPy (>= 1.6.1) (http://numpy.scipy.org)
Install SciPy (>= 0.9) (http://www.scipy.org)
Install scikit-learn 0.14.1 (http://scikit-learn.org/0.14/install.html)
Install GNU Parallel (http://www.gnu.org/software/parallel/)
At a command line, enter:
git clone https://firstname.lastname@example.org/srp33/gsoa.git
Executing the analysis
From within the gsoa directory, execute "scripts/run" followed by a value for each of parameters described below.
- Path to the data file #1
- Path to the data file #2
- Path to the data file #3
- Path where the output file will be stored after execution
Below is an example command:
scripts/run ExpressionData.txt ClassValues.txt c2.cp.v4.0.symbols.gmt Results.txt