iso-kTSP - A modified kTSP algorithm for alternative splicing isoform analysis

version 1.0

This program runs the kTSP classification algorithm giving both intermediate 
results and a final prediction model. There are options to perform the analysis on 
alternative splicing isoforms and to test an already defined kTSP model (of genes or isoforms)
against a dataset. There are options for permutation analysis and randomization.

The command format is: 

java -jar iso-kTSP.jar <input_file> [options]

Options may be included in any order, but if an option requires specific parameters, 
they should follow the option and be separated by spaces. Options should be written 
separately, i.e. no grouping of options as in linux-unix style.

- Mandatory argument
    input dataset filepath: the name or the path of the file with the input dataset.

- Options
   -h : prints this help. (should be the only argument)
   -i : (no parameter) the program will run in the mode for alternative splicing isoforms. 
      	When not included, the standard algorithm for genes is run.
        Note that this option is not required when running the program with a defined model
        with the -m option, even if the model uses isoforms and not genes. For all the other
	modes this option should be included if the program runs over isoform data.
   -o : followed by a filename, defines the output file name. When not included, 
      	the output name will be the same as the input adding "_output" at the end or 
        "_output_mod", "_output_rand" and "_output_randlabels" when run with -p, -r and -l
        options, respectively.
   -n : followed by an integer, defines the number of iterations for the cross-validation 
        step of the algorithm (and indirectly determines the size of the test portion 
	for the cross-validation). Default is 10. The integer specified by this option 
        should be positive and not greater than the size of the sample set for the
        least represented class.
   -k : followed by an integer, defines the maximum value of the variable k, k_max, for the 
      	kTSP algorithm. Default is 10. Note that the algorithm uses only odd values for 
        accuracy testing but this option accepts an even number, since it denotes a 
	maximum (e.g. defining the maximum k as 9 or 10 has exactly the same effect).
   -s : followed by an integer, defines the number of the best (gene or isoform) pairs 
      	displayed at the final step of the algorithm with their single-pair performances.	
	This number does not affect the number of pairs k_opt proposed from the cross-validation
	and can be greater than the defined maximum k_max. Default is 10.
   -c : followed by two strings separated by space, which define the suffixes in the sample 
      	names used to separate between the two classes used for classification. Default is N and T.
   -p : followed by the name or path of the file defining a (gene or isoform) kTSP model, 
      	which is tested in a prediction-only mode against the provided dataset. 
	See below for details on the format of the model file. 
   -d : (no parameter) this option can be used with -p or -r options to report for each tested sample the number
      	of correct and incorrect votes.
   -l : followed by an integer, defines the number of iterations for the permutation analysis on the labels. 
      	The program will perform the final selection step of the algorithm over this number of iterations,
        each time with a random permutation of the sample labels. For each permutation the best (gene or isoform) 
	pair and corresponding single-pair performance is reported.
   -r : followed by an integer, defines the number of random (gene or isoform) pairs to be tested in prediction-only mode
      	against the provided dataset. The integer specified should be odd and positive.
   --seed : 
   	followed by an integer, defines the seed that is used for every random step in the algorithm.
        If not present, the seed will be selected as defined in the java class Random when no seed is specified.

Ignored options:
      Some options specify different modes in which the program runs, and some options have no effect on specific modes.
      If the user specifies an option that is not needed for the selected mode, this option will be ignored and the
      program will continue running normally after printing a warning. Because it is impossible to run the program in
      different modes at the same time, some modes will take priority over others, and the corresponding options will be
      ignored. Here is a list of the options that are ignored for each mode.
      -p mode ignores options -n, -k, -s, -i, -l and -r.
      -r mode ignores options -n, -k, -s and -l.
      -l mode ignores options -n, -k, -s and -d.
      Normal mode (when none of the previous modes is specified) ignores option -d,

Examples of calls:
	 java -jar iso-kTSP gene_seq.txt
	 java -jar iso-kTSP -o out_iso_analysis.txt -i -k 12
	 java -jar iso-kTSP -o out_iso_analysis.txt /home/user/ -c tumor normal -i -n 15 -s 40 -k 4

Input format: 
      The expected format for the input dataset is a tab-separated plain text file (with any extension), 
      where the first row contains the sample labels with suffixes to differentiate between samples 
      belonging to different classes, not necessarily paired. Subsequent lines contain the "gene_id", 
      or "gene_id,isoform_id" for isoforms, in the first column followed by the sample data values 
      (in any numerical format that java can parse), in the same order as in the first row.

      The expected format for the model input file (when using the option -m) is a plain text file 
      (with any extension) that should contain in each line a pair of "gene_id", or of "gene_id,isoform_id" 
      for isoforms, separated by a single whitespace. The number of pairs in the file must be odd.

Output format:
       The output has multiple lines with different formats. In each line, it is first reported the iteration 
       within the cross-validation (or "final" if related to the last steps of the algorithm, after the cross-validation) 
       is given, and then the type of result and the result are given:

       iteration=i kmax_pair         : at each iteration of the cross-validation this lines provide the k_max best scoring pairs
       		   	               selected in the learning part of that iteration with its scores in ranking order
       iteration=i k_performance     : at each iteration of the cross-validation this lines provide the results
       		   		       of the prediction using the top k pairs listed before, where k is odd and smaller than k_max. 
				       The performance is provided in terms of the number of true and false predictions: 
				       "Tclass1" (true class1), where class1 is class 1 label, means that a sample was predicted 
				       to be class 1 and the prediction was right, whereas "Fclass1", means a sample was predicted 
				       to be class1 but the prediction was wrong; and similarly for class 2.
       final k_average_performance :   After the cross-validation, these lines provide the average performance for each tested k 
             			       over all iterations, and k_opt is defined to be the smallest k < k_max and k odd
				       that has the best average performance is selected for the final model. The performance 
				       is calculated as the overall success rate (= the proportion  of true 
				       predictions (Tclass1 + Tclass2) over all the predictions made).
       final single_pair_performance : after selecting the best k from the cross-validation, the (-s) pairs are re-scored using all 
       	     			       the input data and the best k_opt pairs are selected for the final model. The performance of
				       each single pair is provided together with the Information Gain and the scores used for selection. 
       final model_pair 	     : the pairs that the algorithm chooses for the final model. Basically, the k top pairs from the 
       	     			       previous list (with the k selected from the cross-validation).

       For the prediction only mode (options -p and -r), the output is the single pair performance of each of the pairs in 
       the model and the overall performance of the complete prediction model. 

       When -d option is present, the output will also contain the specific details of the prediction of each sample, 
       giving the number of correct and incorrect votes, that is, the number of pairs in the model that contribute to predict 
       correctly or incorrectly each sample.

       For the permutation mode, at each iteration only the best scoring pair in that iteration is reported with its single_pair_performance.

       The semantics of a pair-rule is that if the first element is lower than the second in the ranking of expression, the prediction is class1, 
       and in any other case the prediction is class2.