Wiki
Clone wikiPanCake / Documentation
#Initialization of a PanCake Data Object
Call
pancake create
by providing at least one DNA sequence. This can be done by specification of
-
.fasta or multiple .fasta file(s) via parameter
-s
. -
(a list of) gi ids via parameter
-i
. Corresponding sequence data will be downloaded from the NCBI data resource. If using this utility make sure to provide your email address via parameter--email
. In case of misapplication NCBI will contact a user at the e-mail address prior to blocking access.
Sequence names will be parsed automatically from input files, but can be changed subsequently by calling pancake specify
.
Use parameter -p
to specify the text file, your PanCake Data Object will be written to. If no output file is specified, it is written to ./pan_files/pancake.pan
.
It is highly recommended, to include information from pairwise alignments immediatly into a newly created PanCake Data Object, as it will decrease the output file's size noticeably.
#Including pairwise Alignment Information
Information from pairwise alignment can be included during execution of
-
pancake create
(PanCake Data Object Initialization) -
pancake addChromosomes
(add Chromosomes to an existing PanCake Object) -
pancake addAli
(include alignment information into an existing PanCake Object)
by parameter -a <ALIGNMENT_FILE>
.
Currently, alignments can be provided by two input file types, namely
-
BLAST's default output format type (defined as 'pairwise', ouput format 0)
-
.delta files, the output file format of nucmer
Types of provided alignment files are detected automatically.
Alignments can be filtered by the following additional parameters:
-
-l <MIN_LEN>
: each of the sequences in a pairwise alignment has to be of length equal or greater than MIN_LEN to be included (DEFAULT: MIN_LEN=25) -
-nsa
: if set, pairwise alignments between regions on identical chromosomes are excluded from input alignments (DEFAULT=False)
You can write the emerging PanCake Data Object to a new .pan file specified by paramter -o <NEW_PAN_FILE>
.
This will leave the original PanCake Object unchanged.
#PanCake Object Overview
An overview about a PanCake Data Object is retrieved by calling pancake status <PAN_FILE>
Example output:
#!text PanGenome Object consists of 117 un-aligned FIs & 71236 aligned FIs (organized in 3105 Shared Features) # 9 chromosomes representing 3 genomes, namely: # Genome A. baumannii 1656-2 >CP001923, gi|322509998|gb|CP001923.1|(8041bp) --> 4 un-aligned Feature Instances (mean length 787.5) --> 48 aligned Feature Instances (mean length 101.89583333333333) in 24 Shared Features >CP001921, gi|322506180|gb|CP001921.1|(3940614bp) --> 36 un-aligned Feature Instances (mean length 1173.9166666666667) --> 21953 aligned Feature Instances (mean length 177.5772331799754) in 2562 Shared Features >CP001922, gi|322509896|gb|CP001922.1|(74451bp) --> 9 un-aligned Feature Instances (mean length 765.8888888888889) --> 154 aligned Feature Instances (mean length 438.68831168831167) in 61 Shared Features # Genome A. baumannii AYE >gi|169147133|emb|CU459141.1|(3936291bp) --> 29 un-aligned Feature Instances (mean length 3283.7586206896553) --> 24662 aligned Feature Instances (mean length 155.74819560457385) in 2399 Shared Features >gi|169147044|emb|CU459139.1|(2726bp) --> 1 un-aligned Feature Instances (mean length 2726.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features >gi|169147050|emb|CU459140.1|(94413bp) --> 17 un-aligned Feature Instances (mean length 4591.117647058823) --> 1128 aligned Feature Instances (mean length 14.50709219858156) in 166 Shared Features >gi|169147024|emb|CU459137.1|(5644bp) --> 2 un-aligned Feature Instances (mean length 2385.0) --> 30 aligned Feature Instances (mean length 29.133333333333333) in 21 Shared Features >gi|169147032|emb|CU459138.1|(9661bp) --> 1 un-aligned Feature Instances (mean length 1592.0) --> 42 aligned Feature Instances (mean length 192.11904761904762) in 33 Shared Features # Genome A. baumannii AB307-0294 >gi|213985689|gb|CP001172.1|(3760981bp) --> 18 un-aligned Feature Instances (mean length 2637.0555555555557) --> 23219 aligned Feature Instances (mean length 159.93427796201388) in 2252 Shared Features
#Rename Chromosomes and specify Genomes
By default, chromosome names are parsed automatically from input sequence files,
and each chromosome belongs to its own genome with genome name identical to chromosome's name.
Once a PanCake Object is intialized, changing chromosome names as well as assignment of chromosomes to genomes is done by calling pancake specify
.
The most convinient way of specifying genomes and change chromosome names is providing a tab-separated file with parameter -f <file>
like
#!text Genome1 Chromosome1.1 new name of Chromosome1.1 Genome1 Chromosome1.2 new name of Chromosome1.2 Genome2 Chromosome2.1 Genome3 Chromosome3.1 Genome3 Chromosome3.2 new name of Chromosome3.2
#!text pancake specify -p path/to/your/panfile -f <file>
NOTE Chromosomes are allowed to have several names, but not two chromosomes are allowed to have the same name. Whenever a chromosome name is specified, this will lead to an additional name by which the corresponding chromosome can be addressed.
You can delete a chromosome's name via pancake specify -p file/to/your/panfile -d <name_to delete>
.
If <name_to delete>
is the only name of the corresponding chromosome, PanCake will warn you and interrupt.
Chromosome names and genomes can be specified separately as well. Add an additional chromosome name via
#!text pancake specify -p path/to/your/panfile -c <chromosome> -n <new_name>
Group chromosomes into genomes via
#!text pancake specify -p path/to/your/panfile -c <chromosome1.1> <chromosome1.2> ... -g <genome_name>
NOTE In contrast to chromosomes, genomes are only allowed to have a single name. If you specify chrom1 belonging to genome G, and subsequently specify chrom2 belonging to G, chrom1 and chrom2 are part of the same genome.
#Identification of Singletons
To get a genome's set of singleton regions type:
#!text pancake singletons -p path/to/your/panfile -rg <genome>
#!text pancake singletons -p path/to/your/panfile -rc <chromosome>
Identified singleton regions are filtered by their length. You can specify a minimum length of valid regions via -l <min_length>
.
By default PanCake computes singleton regions dependent on ALL genomes included in the PanCake Object given by the specified panfile. In order to curtail this to a subset of genomes (i.e. chromosomes) ther exist four possibilities, namely
-- specify the genome set explicitly via -nrg <genome1> <genome2> <genome3>
-- specify the chromosomes to which to compare with explicitly via -nrc <chrom1> <chrom2> <chrom3>
-- exclude genomes from the set of genomes to compare with via -eg <genome1> <genome2> <genome3>
-- exclude chromosomes from the set of chromosomes to compare with via -ec <chrom1> <chrom2> <chrom3>
You are allowed to state any combinations of the given parameters, PanCake will give you an overview of the final set of all genomes and chromosomes which are considered in comparison.
Computation of singletons will always produce a bed file containing identified regions. By default, this is <genome>.bed
, respectively <chromosome>.bed
.
You can specify an alternative file name by parameter -b <filename>
.
By default, PanCake also produces a folder singletons_<genome>/
, respectively singletons_<chromosome>/
containing a FASTA per identified region.
FASTA output can be suppressed by setting flag -no
. An alternative output directory can be obtained via -o <folder_name>
.
#Identification of Core Genes
In order get a genome's core regions type:
#!text pancake core -p path/to/your/panfile -rg <genome>
#!text pancake core -p path/to/your/panfile -rc <chromosome>
In general, computation of core regions depends on identical parameters as the identification of singletons (i.e. specify a minimum length, specify the set of genomes or chromosomes to compare with -nrg
, -nrc
, -eg
or -ec
, specify output files by -b
and -o
, or supress FASTA output by setting flag -no
).
However, core regions are determined due to two additional parameters, namely
-- the maximum space (i.e. count of base pairs) allowed between two consecutive core regions in order to summarize them into one valid core region. This maximum space can be specified via parameter -s <max_space>
(default=25).
-- the maximum fraction of an identified core region allowed to be not aligned to ALL chromosomes under consideration. This fraction can be specified via parameter -f <max_frac>
(default=0.05). See an example below.
EXAMPLE Consider a chromosome with subsequences between positions 1 and 30, as well as between positions 50 and 75, identified as part of the core genome. The region between 31 and 49 is known to be not aligned to all other genomes under consideration, and hence, not a valid core region part.
Then, merging the region between 1 and 75 into one big core region would yield a fraction of non-valid positions of (49-31+1)/75=0.253 of within the resulting 'core region'. As 0.253 is greater than maximum fraction 0.05, both core regions will appear separately in output.
#Retrieval of sequence data
At any time, a PanCake Object provides errorless sequence retrieval of all chromosomes included.
Calling
#!text pancake sequence -p <panfile> -start <start_pos> -stop <stop_pos> <chromosome>
<chromosome>
between<start_pos>
and <stop_pos>
(inclusively) on standard output.
If an output file is provided via -f <file_name>
, sequence is written to <filename>
(including FASTA header). Thereby, line breaks are included every l-th position. l can be specified by parameter -l <l>
. By default, a line of the resulting FASTA file will be of length l=100.
If no <start_pos>
is defined, the sequence starts at base position 1 on <chromosome>
. Accordingly, if no stop position is specified, the sequence stops at last base of <chromosome>
.
Updated