Wiki

Clone wiki

ELSA / Manual

ELSA Manual

Input Format (check_data)

Please use check_data first to check if your data file is compatible with ELSA. Transferring text files among Mac, Linux and Windows can easily mess up your formats. (Here is the reason: https://en.wikipedia.org/wiki/Newline). So always do this before lsa_compute.

usage: check_data [-h] dataFile repNum spotNum

Auxillary tool to new LSA package for checking data format

positional arguments:
  dataFile    the data file
  repNum      replicates number
  spotNum     timepoints number

optional arguments:
  -h, --help  show this help message and exit

The input has to be a tab delimited matrix file, for example the following one:

#F3T4R2  t1r1  t1r2  t2r1  t2r2  t3r1  t3r2  t4r1  t4r2
f1       na    2     3     0     na    1     3     5
f2       10    na    na    3     na    9     3     3
f3       -2    -4    na    1     na    0     1     1

For this example file, spotNum=4 and repNum=2.

So each column is one replicate from one time point. t1r1 is replicate one from timepoint one. And each row is a factor. f1 is factor one.

Make the top left cell whatever but start with an '#'. 'na' is reserved for missing value. You might what to take a note of the number of factors, timespots and replicates, some which are needed for executing the program.

If you are using Excel for preparing the input file, remember to take out any trailing and leading empty rows, columns or cells. Make the table a real rectangle, not only a visually one! That shall do the input.

Computation (lsa_compute)

lsa_compute (rev: v1.0.2@GIT: fd167ef) - copyright Li Charlie Xia, lixia@stanford.edu
usage: lsa_compute [-h] [-e EXTRAFILE] [-d DELAYLIMIT] [-m MINOCCUR]
                   [-p {perm,theo,mix}] [-x PRECISION]
                   [-b {0,100,200,500,1000,2000}] [-r REPNUM] [-s SPOTNUM]
                   [-t {simple,SD,Med,MAD}]
                   [-f {none,zero,linear,quadratic,cubic,slinear,nearest}]
                   [-n {percentile,percentileZ,pnz,robustZ,rnz,none}]
                   [-q {scipy}] [-T TRENDTHRESH] [-a APPROXVAR]
                   [-v PROGRESSIVE]
                   dataFile resultFile

positional arguments:
  dataFile              the input data file, m by (r * s)tab delimited text;
                        top left cell start with '#' to mark this is the
                        header line; m is number of variables, r is number of
                        replicates, s it number of time spots; first row:
                        #header s1r1 s1r2 s2r1 s2r2; second row: x ?.?? ?.??
                        ?.?? ?.??; for a 1 by (2*2) data
  resultFile            the output result file

optional arguments:
  -h, --help            show this help message and exit
  -e EXTRAFILE, --extraFile EXTRAFILE
                        specify an extra datafile, otherwise the first
                        datafile will be used and only lower triangle entries
                        of pairwise matrix will be computed
  -d DELAYLIMIT, --delayLimit DELAYLIMIT
                        specify the maximum delay possible, default: 0, must
                        be an integer >=0 and <spotNum
  -m MINOCCUR, --minOccur MINOCCUR
                        specify the minimum occurence percentile of all times,
                        default: 50,
  -p {perm,theo,mix}, --pvalueMethod {perm,theo,mix}
                        specify the method for p-value estimation, default:
                        pvalueMethod=perm, i.e. use permutation theo:
                        theoretical approximaton; if used also set -a value.
                        mix: use theoretical approximation for pre-screening
                        if promising (<0.05) then use permutation.
  -x PRECISION, --precision PRECISION
                        permutation/precision, specify the permutation number
                        or precision=1/permutation for p-value estimation.
                        default is 1000, must be an integer >0
  -b {0,100,200,500,1000,2000}, --bootNum {0,100,200,500,1000,2000}
                        specify the number of bootstraps for 95% confidence
                        interval estimation, default: 100, choices: 0, 100,
                        200, 500, 1000, 2000. Setting bootNum=0 avoids
                        bootstrap. Bootstrap is not suitable for non-
                        replicated data.
  -r REPNUM, --repNum REPNUM
                        specify the number of replicates each time spot,
                        default: 1, must be provided and valid.
  -s SPOTNUM, --spotNum SPOTNUM
                        specify the number of time spots, default: 4, must be
                        provided and valid.
  -t {simple,SD,Med,MAD}, --transFunc {simple,SD,Med,MAD}
                        specify the method to summarize replicates data,
                        default: simple, choices: simple, SD, Med, MAD NOTE:
                        simple: simple averaging SD: standard deviation
                        weighted averaging Med: simple Median MAD: median
                        absolute deviation weighted median;
  -f {none,zero,linear,quadratic,cubic,slinear,nearest}, --fillMethod {none,zero,linear,quadratic,cubic,slinear,nearest}
                        specify the method to fill missing, default: none,
                        choices: none, zero, linear, quadratic, cubic,
                        slinear, nearest operation AFTER normalization: none:
                        fill up with zeros ; operation BEFORE normalization:
                        zero: fill up with zero order splines; linear: fill up
                        with linear splines; slinear: fill up with slinear;
                        quadratic: fill up with quadratic spline; cubic: fill
                        up with cubic spline; nearest: fill up with nearest
                        neighbor
  -n {percentile,percentileZ,pnz,robustZ,rnz,none}, --normMethod {percentile,percentileZ,pnz,robustZ,rnz,none}
                        must specify the method to normalize data, default:
                        robustZ, choices: percentile, none, pnz, percentileZ,
                        robustZ or a float NOTE: percentile: percentile
                        normalization, including zeros (only with perm) pnz:
                        percentile normalization, excluding zeros (only with
                        perm) percentileZ: percentile normalization +
                        Z-normalization rnz: percentileZ normalization +
                        excluding zeros + robust estimates (theo, mix, perm
                        OK) robustZ: percentileZ normalization + robust
                        estimates (with perm, mix and theo, and must use this
                        for theo and mix, default)
  -q {scipy}, --qvalueMethod {scipy}
                        specify the qvalue calculation method, scipy: use
                        scipy and storeyQvalue function, default
  -T TRENDTHRESH, --trendThresh TRENDTHRESH
                        if trend series based analysis is desired, use this
                        option NOTE: when this is used, must also supply
                        reasonble values for -p, -a, -n options
  -a APPROXVAR, --approxVar APPROXVAR
                        if use -p theo and -T, must set this value
                        appropriately, precalculated -a {1.25, 0.93, 0.56,0.13
                        } for i.i.d. standard normal null and -T {0, 0.5, 1,
                        2} respectively. For other distribution and -T values,
                        see FAQ and Xia et al. 2013 in reference
  -v PROGRESSIVE, --progressive PROGRESSIVE
                        specify the number of progressive output to save
                        memory, default: 0, 2G memory is required for 1M
                        pairwise comparison.

So we can analyze the above example file by:

  lsa_compute ../test/testna.txt ../test/testna.lsa -r 2 -s 4 -d 1

eLSA will take ../test/testna.txt as input, and knows it has 4 timespots each with 2 replicates. And eLSA will analyze it with maximum delay of 1 time unit. The output file is explained below.

Output

X	Y	LS	lowCI	upCI	Xs	Ys	Len	Delay	P	PCC	Ppcc	SPCC	Pspcc	Dspcc	SCC	Pscc	SSCC	Psscc	Dsscc	Q	Qpcc	Qspcc	Qscc	Qsscc	Xi	Yi
f1	f2	-0.349677	-0.349677	-0.349677	3	3	2	0	0.316000	-0.512027	0.487973	0.520401	0.651565	1	-0.210819	0.789181	0.500000	0.666667	1	0.451429	0.731960	0.782394	1.000000	1.000000	1	2
f1	f3	0.349677	0.349677	0.349677	3	3	2	0	0.675000	0.217606	0.782394	0.217606	0.782394	0	0.210819	0.789181	-0.500000	0.666667	1	0.642857	0.782394	0.782394	1.000000	1.000000	1	3
f2	f3	-1.125007	-1.125007	-1.125007	1	1	4	0	0.215000	-0.827992	0.172008	0.991241	0.084323	-1	-1.000000	nan	-1.000000	nan	0	0.451429	0.516024	0.252970	nan	nan	2	3
  • X: factor name X
  • Y: factor name Y
  • LS: Local Similarity Score
  • low/upCI: low or up 95% CI for LS
  • Xs: align starts position in X
  • Ys: align starts position in Y
  • Len: align length
  • Delay: calculated delay for align, Xs-Ys
  • P,Q: p/q-value for LS
  • PCC,Ppcc,Qpcc: Pearson's Correlation Coefficient, p/q-value for PCC
  • SCC,Pscc,Qscc: Spearman's Correlation Coefficient, p/q-value for SCC
  • SPCC,Pspcc,Qspcc,Dspcc: delay-Shifted Pearson's Correlation Coefficient, p/q-value, delay size for SPCC
  • SSCC,Psscc,Qsscc,Dsscc: delay-Shifted Spearman's Correlation Coefficient, p/q-value, delay size for SSCC

Speed Up (par_ana)

You can use par_ana.py and ssa.py to speed up your analysis using parallelism in high performance computing clusters.

Then "par_ana -h" tells you how to use the script for computing. In the singleCmd options, with your normal single line lsa_comput command, now replace your input and output by %s symbol. The input and output is now supplied to multiInput and multiOutput options now. Here the input is ARISA.txt and the output is ARISA.lsa.

Example: par_ana ARISA20.txt ARISA20.lsa 'lsa_compute %s %s -e ARISA20.txt -s 127 -r 1 -p theo' $PWD
Example: par_ana ARISA20.txt ARISA20.la 'la_compute %s ARISA20.laq %s -s 127 -r 1 -p 1000' $PWD
vmem= 2000mb
usage: par_ana [-h] [-d DRYRUN] multiInput multiOutput singleCmd workDir

Multiline Input Split and Combine Tool for LSA and LA

positional arguments:
  multiInput            the multiline input file
  multiOutput           the multiline output file
  singleCmd             single line command line in quotes
  workDir               set current working directory

optional arguments:
  -h, --help            show this help message and exit
  -d DRYRUN, --dryRun DRYRUN
                        generate pbs only

par_ana will use ssa.py to submit the pbs jobs to batch system.

usage: ssa.py [-h] pbsFile

MCB Queue Checking and Submission Tool

positional arguments:
  pbsFile     single pbs file to be submitted

optional arguments:
  -h, --help  show this help message and exit

Put the ssa.py (shipped in elsa_pkg/lsa/ssa.py) into your path and set the queue parameters correctly set inside the script.

Example: you have 63 cores with #300Gig# mem in the queue #main# and your username is #user#.

core_max=63
mem_max=300
uname="user"
qname="main"

FAQ

Wondering: 1. whether to permutation or theoretical p-values? 2. which normalization to choose?

Or any other doubts, first refer to the FAQ.

Have fun!

Updated