Wiki

Clone wiki

SVEngine / Manual

Manual

0. $SVENGINEPKG environmental variable

$SVENGINEPKG should be set to the path where you unzip or cloned the svengine package. Below is where you can locate the original script or example file within $SVENGINEPKG.

1. Short manuals for Python and R scripts are also available with '-h' option

$SVENGINEPKG/svengine/mf/mutforge.py.

The command implements a parallelized algorithm that divides template genome into trunks of contigs, spikes structural variants into contigs, samples short reads from the altered contigs, and finally merges back the short-read sets into one and performs the alignment.

usage: mutforge.py [-h] [-m METAFILE] [-v VARFILE] [-b BURNIN] [-e TRUNKSIZE]
                   [-o OPREFIX] [-y NPLOIDY] [-f PLANSIZE] [-g NLIGATION]
                   [-n NPROCS] [-r MERGEMAX] [-d TMPDIR] [-s SKIPFILE]
                   [-t TARGETFILE] [-x {fasta,fastq,bam}] [-c CHOOSESEED]
                   [--layout] [--debug]
                   hapfiles parfile reffile

spike-in mutations to fasta/bam files, for bam files first assemble haplo-
types for fasta files we segment fasta files to confirm if the configuration
is viable

positional arguments:
  hapfiles              inputfile of hap file(s), required
  parfile               inputfile of library-wise parameters, required
  reffile               inputfile of ref file(s), required

optional arguments:
  -h, --help            show this help message and exit
  -m METAFILE, --metafile METAFILE
                        inputfile of meta specification
  -v VARFILE, --varfile VARFILE
                        inputfile of variant specification
  -b BURNIN, --burnin BURNIN
                        burnin is the chromosome tip skip size for genome
                        planning to avoid tip padding Ns, default: 1100000 bp
  -e TRUNKSIZE, --trunksize TRUNKSIZE
                        trunksize is the parallelized compartment size of non-
                        var region in simulation, default: 10000000 bp
  -o OPREFIX, --oprefix OPREFIX
                        prefix for output files, default: $varfile or
                        $metafile
  -y NPLOIDY, --nploidy NPLOIDY
                        genome ploidity default= 2 ploid
  -f PLANSIZE, --plansize PLANSIZE
                        plansize + ligation*2 is the grain size for variant
                        planning, recommend plansize <
                        (genomesize/numer_of_variant/100), default: 100000 bp)
  -g NLIGATION, --nligation NLIGATION
                        genome trunk ligation, recommended
                        lib_is_max+2*lib_is_sd_max (default: 500 bp)
  -n NPROCS, --nprocs NPROCS
                        split into multiple processes (default: 1 )
  -r MERGEMAX, --mergemax MERGEMAX
                        max number of bamfiles in one iteration of samtool
                        merging (default: 4000 )
  -d TMPDIR, --tmpdir TMPDIR
                        root dir for keeping temporary files, default (last 4
                        digits random): /Users/charlie/svetmp_gpWT
  -s SKIPFILE, --skipfile SKIPFILE
                        inputfile of regions to skip, e.g. gap regions in bed
                        format
  -t TARGETFILE, --targetfile TARGETFILE
                        inputfile of regions to target, for example Exome
                        region in bed format
  -x {fasta,fastq,bam}, --chooseOutput {fasta,fastq,bam}
                        choose the type of output of mutforge, (default:
                        fastq)
  -c CHOOSESEED, --chooseSeed CHOOSESEED
                        choose the seed for python random function, (default:
                        None bp)
  --layout              dry run to layout
  --debug               debug run

$SVENGINEPKG/svengine/mf/tree2var.py.

The tree2var command implements a procedure that determines variant fractions from an input phylogeny tree based on Equations (1) and (2) and a depth first search graph algorithm, and then substitutes these allele fractions in an input VAR file. (see SVEngine publication).

usage: tree2var.py [-h] [-o OPREFIX] [-p] varfile nwkfile

assign tree structures to VAR files

positional arguments:
  varfile               inputfile of SVEngine's VAR format, required
  nwkfile               inputfile of Newick format where the internal nodes
                        are named and matches to MID used in varfile, required

optional arguments:
  -h, --help            show this help message and exit
  -o OPREFIX, --oprefix OPREFIX
                        prefix for output files, default is the same as
                        varfile's prefix yet change suffix to .vtr
  -p, --plottree        generate a text plot of a evolution phylogenetic tree

2. Short manuals for text based input are also found within the commented lines of the file

$SVENGINEPKG/svengine/test/example.var.

The VAR format is intended for specifying exact information for individual variants, which includes variant id, parent id (if it is part of a complex event, e.g., the deletion part within a translocation), fraction, ploidy, chromosome, starting position, and the sequence length to be deleted and/or the sequence content to be inserted.

#This example VAR file specified some prototype structural variants shown in the example feature track plots
#Acronyms: Deletion(DEL), Insertion(INS), Inversion(INV), Tandem Duplication(DUP), Translocation(TRA)
#Acronyms: Domestic INS(DINS), Foreign INS(FINS), Inverted TRA(ITRA), Inverted DUP(IDUP),
#VID: Variant ID for haploid-wise element variants, one unique id for each row
#MID: Mega ID for mega variants, e.g. a translocation with have one MID and two VIDs for del and ins parts.
#VARFREQ: haploid-wise variant allele frequency
#VARHAP: variant affect which haploid, 0-based, 0=first haploid
#VARCHR: variant affect which chromosome
#VARDEL: if the variant carrying a deletion from its starting position, True/False only
#VARDELSIZE: the size of the carried deletion. if no deletion, always 0
#VARINS: if the variant carrying a insertion from its starting position, True/False only
#VARINSSEQ: the content of the carried insertion. if no insertion, always None.
#           (HAP/SEQFILE,CHR:START-END,COPY,REVCOMP) - format to specify insertion content, e.g.
#           example.fas,fins1:1-1200,1,f: says the sequence is from '1-1200bp' of 'fins1' contig in 'example.fas'.
#                                         it is duplicated 1 times while it is not reverse complemented.
#VID    MID     VARFREQ VARHAP  VARCHR  VARPOS  VARDEL  VARDELSIZE      VARINS  VARINSSEQ(HAP/SEQFILE,CHR:START-END,COPY,REVCOMP)
VAR_1   FINS_1  0.5     0       1       1038000 False   0       True    example.fas,fins1:1-1200,1,f
VAR_2   FINS_1  0.5     1       1       1038000 False   0       True    example.fas,fins1:1-1200,1,f
VAR_3   FINS_2  0.5     0       1       1142000 False   0       True    example.fas,fins2:1-1200,1,f
VAR_4   FINS_2  0.5     1       1       1142000 False   0       True    example.fas,fins2:1-1200,1,f
VAR_5   ITRA_1  0.5     0       2       102000  True    1000    False   None
VAR_6   ITRA_1  0.5     1       2       102000  True    1000    False   None
VAR_7   ITRA_1  0.5     0       1       1246000 False   0       True    example.fna,2:102001-103000,1,r
VAR_8   ITRA_1  0.5     1       1       1246000 False   0       True    example.fna,2:102001-103000,1,r
VAR_9   ITRA_2  0.5     0       1       1350000 True    1000    False   None
VAR_10  ITRA_2  0.5     1       1       1350000 True    1000    False   None
VAR_11  ITRA_2  0.5     0       2       206000  False   0       True    example.fna,1:1350001-1351000,1,r
VAR_12  ITRA_2  0.5     1       2       206000  False   0       True    example.fna,1:1350001-1351000,1,r
VAR_13  DEL_1   1.0     0       1       1454000 True    1000    False   None
VAR_14  DEL_1   1.0     1       1       1454000 True    1000    False   None
VAR_15  DEL_2   1.0     0       2       310000  True    2000    False   None
VAR_16  DEL_2   1.0     1       2       310000  True    2000    False   None
VAR_17  DEL_3   1.0     0       2       414000  True    2000    False   None
VAR_18  DEL_3   1.0     1       2       414000  True    2000    False   None
VAR_19  DEL_4   1.0     0       2       518000  True    500     False   None
VAR_20  DEL_4   1.0     1       2       518000  True    500     False   None
VAR_21  TRA_1   1.0     0       2       622000  True    1000    False   None
VAR_22  TRA_1   1.0     1       2       622000  True    1000    False   None
VAR_23  TRA_1   0.5     0       2       726000  False   0       True    example.fna,2:622001-623000,1,f
VAR_24  TRA_1   0.5     1       2       726000  False   0       True    example.fna,2:622001-623000,1,f
VAR_25  TRA_2   1.0     0       2       830000  True    1000    False   None
VAR_26  TRA_2   1.0     1       2       830000  True    1000    False   None
VAR_27  TRA_2   0.5     0       2       934000  False   0       True    example.fna,2:830001-831000,1,f
VAR_28  TRA_2   0.5     1       2       934000  False   0       True    example.fna,2:830001-831000,1,f
VAR_29  INV_1   0.5     0       2       1038000 True    1000    True    example.fna,2:1038001-1039000,1,r
VAR_30  INV_1   0.5     1       2       1038000 True    1000    True    example.fna,2:1038001-1039000,1,r
VAR_31  INV_2   0.5     0       2       1142000 True    1000    True    example.fna,2:1142001-1143000,1,r
VAR_32  INV_2   0.5     1       2       1142000 True    1000    True    example.fna,2:1142001-1143000,1,r
VAR_33  IDUP_1  0.5     0       2       1246000 True    1000    True    example.fna,2:1246001-1247000,2,r
VAR_34  IDUP_1  0.5     1       2       1246000 True    1000    True    example.fna,2:1246001-1247000,2,r
VAR_35  IDUP_2  0.5     0       2       1350000 True    1000    True    example.fna,2:1350001-1351000,2,r
VAR_36  IDUP_2  0.5     1       2       1350000 True    1000    True    example.fna,2:1350001-1351000,2,r
VAR_37  DINS_1  0.5     0       2       1558000 False   0       True    example.fna,2:1454001-1455000,1,f
VAR_38  DINS_1  0.5     1       2       1558000 False   0       True    example.fna,2:1454001-1455000,1,f
VAR_39  DINS_2  0.5     0       2       1766000 False   0       True    example.fna,2:1662001-1663000,1,f
VAR_40  DINS_2  0.5     1       2       1766000 False   0       True    example.fna,2:1662001-1663000,1,f
VAR_41  DUP_1   0.5     0       2       1870000 True    1000    True    example.fna,2:1870001-1871000,2,f
VAR_42  DUP_1   0.5     1       2       1870000 True    1000    True    example.fna,2:1870001-1871000,2,f
VAR_43  DUP_2   0.5     0       2       1974000 True    1000    True    example.fna,2:1974001-1975000,2,f
VAR_44  DUP_2   0.5     1       2       1974000 True    1000    True    example.fna,2:1974001-1975000,2,f

$SVENGINEPKG/svengine/test/example.meta.

The META format is intended for an easy higher-level control, allowing one to specify a desired meta distribution of variants, including variant type, and total number of events, size, allele fraction, and ploidy distributions per type. It is also possible to specify where and how to sample insertion sequence in the case of foreign insertion. For example, a user can easily request 100 deletions of size ranging from 100bp to 10kbp, of uniform allelic fraction distribution and fair coin Bernoulli distribution of homo- and heterozygosity by one line of test in the META file. These parameters can be stipulated in a single line.

#METATYPE:  acronym used to specify a structural variant class:
#           Acronyms: Deletion(DEL), Insertion(INS), Inversion(INV), Tandem Duplication(DUP), Translocation(TRA)
#           Acronyms: Domestic INS(DINS), Foreign INS(FINS), Inverted TRA(ITRA), Inverted DUP(IDUP)
#           the same variant class name can appear in multiple rows with different configuration
#COUNT:     total number of the variant to be spiked-in of the varaint class defined in this row
#SIZEDIST/SEQ: length of the sequence to be deleted/inserted if domestic insertion.
#              Or a filename containing contigs to be randomly sampled for foreign insertion.
#              The length could be equal prob multinormial multi(size1, size2, ...) : fix_size1_size2_...
#              one can specify any size multiple times to increase its proportion.
#              Or other supported distribution with parameters: dist_par1_par2_...
#              e.g. nori_mean_sd: normal distribution rounded to integer
#                   norf_mean_sd: normal distribution
#                   unii_low_up: uniform distribution of integer from low to up
#                   unif_low_up: uniform distribution
#                   expi_rate: exponential distribution rounded to integer
#                   expf_rate: exponential distribution
#COPYDIST:  numpy of times that the insertion sequence to be duplicated
#           supports applicable distribution as string encoding as SIZEDIST field
#DONORDIST: a donor is the segment of haploid content marked and removed to be changed by the variant
#           e.g. a deletion donates a segment without a reception
#           e.g. a inversion donates a segment and receives a reverse complemented segment
#           e.g. a tandem duplication donates a segment and receives a duplicted segment
#           e.g. a insertion only receives either a domestic or a foreign segment without donation
#           potential values: 0, 1, 2, 3, ..., 2^nploid-1;
#           index to specify the group of haploids impacted by donation action; 0 is none; 2^nploid-1 is all
#           if a sequqnce DONOR is multiple haploid then a random one among them is chosen
#           supports applicable distribution as string encoding as SIZEDIST field
#RECPTDIST: a recepient is the segment of haploid receiving pontentially altered donor content
#           potential values: 0, 1, 2, 3, ..., 2^nploid-1;
#           index to specify the group of haploids impacted by reception action; 0 is none; 2^nploid-1 is all
#           if a sequence RECPT is multiple then every in the multiple is receiving the chosen donor seq
#           e.g. FOR HOMOZYGOUS EVENTS at nploid=2:
#                DONOR must be fix_3, RECPT must be fix_3
#                FOR HEMIZYGOUS EVENTS at nploid=2:
#                DONOR can be either fix_1 or fix_2, RECPT can be either fix_1 or fix_2
#           supports applicable distribution as string encoding as SIZEDIST field
#DONORFREQ: allelic frequence for donor actions (the content removing part of a structural variation)
#RECPTREQ:  allelic frequence for recepient actions (the content insertion part of a structural variation)
#NOTE:      any human readable notes go here, which will be ignored by machine
#METATYPE   COUNT   SIZEDIST/SEQ    COPYDIST        DONORDIST       RECPTDIST       DONORFREQ       RECPTFREQ       NOTE
FINS        2       example.fas     fix_1   fix_3   fix_3   fix_0.5 fix_0.5 #FINS=COPY*INS, foreign insertion
ITRA        2       fix_1000        fix_1   fix_3   fix_3   fix_0.5 fix_0.5 #ITRA=1*DEL + COPY * INS, inverted transposition
DEL 4       fix_500_500_1000_2000_2000      fix_1   fix_3   fix_3   fix_1.0 fix_0.5 #DEL=1*DEL, deletion
TRA 2       fix_1000        fix_1   fix_3   fix_3   fix_1.0 fix_0.5 #TRA=1*DEL + COPY*INS, transposition
INV 2       fix_1000        fix_1   fix_3   fix_3   fix_0.5 fix_0.5 #INV=1*DEL + COPY*INS, inversion
IDUP        2       fix_1000        fix_2   fix_3   fix_3   fix_0.5 fix_0.5 #IDUP=1*DEL + COPY*INS, inverted duplication
DINS        2       fix_1000        fix_1   fix_3   fix_3   fix_0.5 fix_0.5 #DINS=COPY*INS, domestic insertion
DUP 2       fix_1000        fix_2   fix_3   fix_3   fix_0.5 fix_0.5 #DUP=COPY*INS, duplication

$SVENGINEPKG/svengine/test/example.par.

#example PAR file in python config format
#currently xwgsim is the only configurable section
#see https://wiki.python.org/moin/ConfigParserExamples for python configure format examples
[xwgsim]
#number of sequencing libraries
nlibrary=2
#library-wise names, json format
libnames=["lib1","lib2"]
#library-wise coverage, json format
coverage: {"lib1":40,"lib2":40}
#library-wise insert-size (isize), allowing mixtures
#  e.g. Here we have two libraries:
#  lib1 has two components: [ [mode11_weight=70%, mode11_mean=250, mode11_sd=20],
#                             [mode12_weight=30%, mode12_mean=400, mode12_sd=40] ]
#  lib2 has one component:  [ [mode21_weight=100%, mode21_mean=300, mode1_sd=30] ]
isize: {"lib1":[[0.7,250,20],[0.3,400,40]],"lib2":[[1,300,30]]}
#libarary-wise read1 length
read1: {"lib1":100,"lib2":100}
#libarary-wise read2 length
read2: {"lib1":100,"lib2":100}
#additional parameter to xwgsim, will be joined into a string spaced by blank
#  use xwgsim -h to view available options
par: {"-e":0.01,"-r":0,"-R":0,"-X":0}

$SVENGINEPKG/svengine/test/egTree.newick.

The tree input to SVEngine is in standard NEWICK format – a widely accepted format using parenthesis to encode nested tree structures. Each internal node is labelled by the population splitting variant and weighted by the conditional splitting fraction. Each leaf node is labelled by associated terminal genotype and weighted by the subpopulation fraction (optional). See http://evolution.genetics.washington.edu/phylip/newicktree.html. For instance, the NEWICK string for the example binary tree in SVEngine's publication is:

((C1,C2) V5:0.8, ((C3,C4)V4:0.8, (C5,C6)V3:0.9) V2:0.6) V1:0.5;

$SVENGINEPKG/svengine/test/egTree.var.

#The basic VAR file format is explained in example.var
#The VARFREQ field in this file is dummy and to be replaced after tree fraction computation
#It is important to note that the MID of this file must corresponds to the internal node label in the egTree.newick file
#The MID column is used as an identifier by tree2var for correctly substituting dummy fractions
#VID        MID     VARFREQ VARHAP  VARCHR  VARPOS  VARDEL  VARDELSIZE      VARINS  VARINSSEQ(HAP/SEQFILE,CHR:START-END,COPY,REVCOMP)
VAR_1       V1      1.0     0       2       1200001 True    50000   False   None
VAR_2       V2      1.0     0       2       1300001 True    50000   False   None
VAR_3       V3      1.0     0       2       1400001 True    50000   False   None
VAR_4       V4      1.0     0       2       1500001 True    50000   False   None
VAR_5       V5      1.0     0       2       1600001 True    50000   False   None

Updated