iso-kTSP - A modified kTSP algorithm for alternative splicing isoform analysis version 1.0 This program runs the kTSP classification algorithm giving both intermediate results and a final prediction model. There are options to perform the analysis on alternative splicing isoforms and to test an already defined kTSP model (of genes or isoforms) against a dataset. There are options for permutation analysis and randomization. The command format is: java -jar iso-kTSP.jar <input_file> [options] Options may be included in any order, but if an option requires specific parameters, they should follow the option and be separated by spaces. Options should be written separately, i.e. no grouping of options as in linux-unix style. - Mandatory argument input dataset filepath: the name or the path of the file with the input dataset. - Options -h : prints this help. (should be the only argument) -i : (no parameter) the program will run in the mode for alternative splicing isoforms. When not included, the standard algorithm for genes is run. Note that this option is not required when running the program with a defined model with the -m option, even if the model uses isoforms and not genes. For all the other modes this option should be included if the program runs over isoform data. -o : followed by a filename, defines the output file name. When not included, the output name will be the same as the input adding "_output" at the end or "_output_mod", "_output_rand" and "_output_randlabels" when run with -p, -r and -l options, respectively. -n : followed by an integer, defines the number of iterations for the cross-validation step of the algorithm (and indirectly determines the size of the test portion for the cross-validation). Default is 10. The integer specified by this option should be positive and not greater than the size of the sample set for the least represented class. -k : followed by an integer, defines the maximum value of the variable k, k_max, for the kTSP algorithm. Default is 10. Note that the algorithm uses only odd values for accuracy testing but this option accepts an even number, since it denotes a maximum (e.g. defining the maximum k as 9 or 10 has exactly the same effect). -s : followed by an integer, defines the number of the best (gene or isoform) pairs displayed at the final step of the algorithm with their single-pair performances. This number does not affect the number of pairs k_opt proposed from the cross-validation and can be greater than the defined maximum k_max. Default is 10. -c : followed by two strings separated by space, which define the suffixes in the sample names used to separate between the two classes used for classification. Default is N and T. -p : followed by the name or path of the file defining a (gene or isoform) kTSP model, which is tested in a prediction-only mode against the provided dataset. See below for details on the format of the model file. -d : (no parameter) this option can be used with -p or -r options to report for each tested sample the number of correct and incorrect votes. -l : followed by an integer, defines the number of iterations for the permutation analysis on the labels. The program will perform the final selection step of the algorithm over this number of iterations, each time with a random permutation of the sample labels. For each permutation the best (gene or isoform) pair and corresponding single-pair performance is reported. -r : followed by an integer, defines the number of random (gene or isoform) pairs to be tested in prediction-only mode against the provided dataset. The integer specified should be odd and positive. --seed : followed by an integer, defines the seed that is used for every random step in the algorithm. If not present, the seed will be selected as defined in the java class Random when no seed is specified. Ignored options: Some options specify different modes in which the program runs, and some options have no effect on specific modes. If the user specifies an option that is not needed for the selected mode, this option will be ignored and the program will continue running normally after printing a warning. Because it is impossible to run the program in different modes at the same time, some modes will take priority over others, and the corresponding options will be ignored. Here is a list of the options that are ignored for each mode. -p mode ignores options -n, -k, -s, -i, -l and -r. -r mode ignores options -n, -k, -s and -l. -l mode ignores options -n, -k, -s and -d. Normal mode (when none of the previous modes is specified) ignores option -d, Examples of calls: java -jar iso-kTSP gene_seq.txt java -jar iso-kTSP -o out_iso_analysis.txt -i -k 12 iso_data.tab java -jar iso-kTSP -o out_iso_analysis.txt /home/user/iso_data.tab -c tumor normal -i -n 15 -s 40 -k 4 Input format: The expected format for the input dataset is a tab-separated plain text file (with any extension), where the first row contains the sample labels with suffixes to differentiate between samples belonging to different classes, not necessarily paired. Subsequent lines contain the "gene_id", or "gene_id,isoform_id" for isoforms, in the first column followed by the sample data values (in any numerical format that java can parse), in the same order as in the first row. The expected format for the model input file (when using the option -m) is a plain text file (with any extension) that should contain in each line a pair of "gene_id", or of "gene_id,isoform_id" for isoforms, separated by a single whitespace. The number of pairs in the file must be odd. Output format: The output has multiple lines with different formats. In each line, it is first reported the iteration within the cross-validation (or "final" if related to the last steps of the algorithm, after the cross-validation) is given, and then the type of result and the result are given: iteration=i kmax_pair : at each iteration of the cross-validation this lines provide the k_max best scoring pairs selected in the learning part of that iteration with its scores in ranking order iteration=i k_performance : at each iteration of the cross-validation this lines provide the results of the prediction using the top k pairs listed before, where k is odd and smaller than k_max. The performance is provided in terms of the number of true and false predictions: "Tclass1" (true class1), where class1 is class 1 label, means that a sample was predicted to be class 1 and the prediction was right, whereas "Fclass1", means a sample was predicted to be class1 but the prediction was wrong; and similarly for class 2. final k_average_performance : After the cross-validation, these lines provide the average performance for each tested k over all iterations, and k_opt is defined to be the smallest k < k_max and k odd that has the best average performance is selected for the final model. The performance is calculated as the overall success rate (= the proportion of true predictions (Tclass1 + Tclass2) over all the predictions made). final single_pair_performance : after selecting the best k from the cross-validation, the (-s) pairs are re-scored using all the input data and the best k_opt pairs are selected for the final model. The performance of each single pair is provided together with the Information Gain and the scores used for selection. final model_pair : the pairs that the algorithm chooses for the final model. Basically, the k top pairs from the previous list (with the k selected from the cross-validation). For the prediction only mode (options -p and -r), the output is the single pair performance of each of the pairs in the model and the overall performance of the complete prediction model. When -d option is present, the output will also contain the specific details of the prediction of each sample, giving the number of correct and incorrect votes, that is, the number of pairs in the model that contribute to predict correctly or incorrectly each sample. For the permutation mode, at each iteration only the best scoring pair in that iteration is reported with its single_pair_performance. The semantics of a pair-rule is that if the first element is lower than the second in the ranking of expression, the prediction is class1, and in any other case the prediction is class2.