Wiki
Clone wikiPanCake / Example Workflow
This is an example of a PanCake workflow analyzing 3 strains of Shigella flexneri, namely
-
S. flexneri 2a 2457T: 1 chromosome of length 4,599,354 bp (accession number NC_004741, gi 30061571)
-
S. flexneri 2a 301: 1 chromosome of length 4,607,202 bp (accession number NC_004337, gi 344915202) & 1 plasmid of length 221,618 bp (accession number NC_004851, gi 31983523)
-
S. flexneri 5 8401:1 chromosome of length 4,574,284 bp (accession number NC_008258, gi 110804074)
Initialization of the PanCake Data Structure
For initialization, complete DNA sequences are required. These can be retrieved directly from NCBI by indicating appropriate sequence identifiers.
A PanCake pangenome object including S. flexneri 2a 2457T and S. flexneri 2a 301 is initialized via
#!text pancake create -i NC_004741 NC_004337 31983523 110804074 -e your.e-mail@address -p Shigella_flexneri.pan
which produces the following output
#!text # ...pangenome initialized ...PanCake Object written to Shigella_flexneri.pan (14002741 bytes)
NOTE It is highly recommended to provide your e-mail address via parameter --email
or -e
as in case of misapplication NCBI will contact you at the e-mail address prior to blocking access. If no email address is specified, there will be a warning.
NOTE By specifying sequence ids via -i
you can swap arbitrarily between accession numbers and gi numbers.
NOTE By default, the PanCake data structure is initialized at ./pan_files/pancake.pan
. An alternative output file is specified via parameter -p
.
Assuming that sequence data of S. flexneri 2a 301 is already present in a (multiple) .fasta file on your system this can be specified via -s
:
#!text pancake create -i NC_004741 110804074 -s path/to/Shigella_flexneri_2a_301.fas -e your.e-mail@address -p Shigella_flexneri.pan
Get information about your PanCake Data Structure
Once the PanCake data structure is initialized, some basic information can be retrieved via
#!text pancake status Shigella_flexneri.pan
#!text PanGenome Object consists of 4 un-aligned FIs & 0 aligned FIs (organized in 0 Shared Features) # 4 chromosomes representing 4 genomes, namely: # Genome gi|30061571|ref|NC_004741.1| >gi|30061571|ref|NC_004741.1| (4599354bp) --> 1 un-aligned Feature Instances (mean length 4599354.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features # Genome gi|31983523|ref|NC_004851.1| >gi|31983523|ref|NC_004851.1| (221618bp) --> 1 un-aligned Feature Instances (mean length 221618.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features # Genome gi|344915202|ref|NC_004337.2| >gi|344915202|ref|NC_004337.2| (4607202bp) --> 1 un-aligned Feature Instances (mean length 4607202.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features # Genome gi|110804074|ref|NC_008258.1| >gi|110804074|ref|NC_008258.1| (4574284bp) --> 1 un-aligned Feature Instances (mean length 4574284.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
By default, PanCake assumes that each chromosome constitutes its own genome. Furthermore, chromosome names are parsed automatically from sequence data.
Specify genomes and change chromosome names
You can determine chromosomes gi|344915202|ref|NC_004337.2| and gi|31983523|ref|NC_004851.1| as belonging to the same genome Shigella flexneri 2a 301 via
#!text pancake specify -p Shigella_flexneri.pan -c 'gi|344915202|ref|NC_004337.2|' 'gi|31983523|ref|NC_004851.1|' -g 'Shigella flexneri 2a 301'
Then, output of pancake status Shigella_flexneri.pan
turns into
#!text PanGenome Object consists of 4 un-aligned FIs & 0 aligned FIs (organized in 0 Shared Features) # 4 chromosomes representing 3 genomes, namely: # Genome gi|30061571|ref|NC_004741.1| >gi|30061571|ref|NC_004741.1| (4599354bp) --> 1 un-aligned Feature Instances (mean length 4599354.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features # Genome Shigella flexneri 2a 301 >gi|31983523|ref|NC_004851.1| (221618bp) --> 1 un-aligned Feature Instances (mean length 221618.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features >gi|344915202|ref|NC_004337.2| (4607202bp) --> 1 un-aligned Feature Instances (mean length 4607202.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features # Genome gi|110804074|ref|NC_008258.1| >gi|110804074|ref|NC_008258.1| (4574284bp) --> 1 un-aligned Feature Instances (mean length 4574284.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
Correspondingly, define genome S. flexneri 2a 2457T via
#!text pancake specify -p Shigella_flexneri.pan -c 'gi|30061571|ref|NC_004741.1|' -g 'Shigella flexneri 2a 2457T'
#!text pancake specify -p Shigella_flexneri.pan -c 'gi|110804074|ref|NC_008258.1|' -g 'Shigella flexneri 5 8401'
As the current (default) chromosome names are unhandy, let's replace them by pure accession numbers:
#!text pancake specify -p Shigella_flexneri.pan -c 'gi|30061571|ref|NC_004741.1|' -n NC_004741
#!text pancake specify -p Shigella_flexneri.pan -c 'gi|344915202|ref|NC_004337.2|' -n NC_004337
#!text pancake specify -p Shigella_flexneri.pan -c 'gi|31983523|ref|NC_004851.1|' -n NC_004851
#!text pancake specify -p Shigella_flexneri.pan -c 'gi|110804074|ref|NC_008258.1|' -n NC_008258
Subsequent output of pancake status Shigella_flexneri.pan
:
#!text PanGenome Object consists of 4 un-aligned FIs & 0 aligned FIs (organized in 0 Shared Features) # 4 chromosomes representing 3 genomes, namely: # Genome Shigella flexneri 5 8401 >NC_008258, gi|110804074|ref|NC_008258.1| (4574284bp) --> 1 un-aligned Feature Instances (mean length 4574284.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features # Genome Shigella flexneri 2a 301 >NC_004851, gi|31983523|ref|NC_004851.1| (221618bp) --> 1 un-aligned Feature Instances (mean length 221618.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features >NC_004337, gi|344915202|ref|NC_004337.2| (4607202bp) --> 1 un-aligned Feature Instances (mean length 4607202.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features # Genome Shigella flexneri 2a 2457T >NC_004741, gi|30061571|ref|NC_004741.1| (4599354bp) --> 1 un-aligned Feature Instances (mean length 4599354.0) --> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
From now on, chromosomes can be addressed by both of their names.
Genome specification and chromosome renaming can be performed more conviniently by providing a special tab-separated specification file. The previously mentioned changes could have been performed as well by calling
#!text pancake specify -p Shigella_flexneri.pan -f specification.txt
specification.txt
containing
#!text Shigella flexneri 2a 301 gi|31983523|ref|NC_004851.1| NC_004851 Shigella flexneri 2a 301 gi|344915202|ref|NC_004337.2| NC_004337 Shigella flexneri 2a 2457T gi|30061571|ref|NC_004741.1| NC_004741 Shigella flexneri 5 8401 gi|110804074|ref|NC_008258.1| NC_008258
##Sequence Retrieval
For each chromosome (or genome) contained in a PanCake data structure a corresponding (multiple) .fasta file can be retrieved. For example, retrieval of genomic sequence of S. flexneri 2a 301 is done via
#!text pancake sequence -p Shigella_flexneri.pan -g 'Shigella flexneri 2a 301' -o S_flex_2a_301.fas
#!text pancake sequence -p Shigella_flexneri.pan -g 'Shigella flexneri 2a 2457T' -o S_flex_2a_2457T.fas
and sequence data of S. flexneri 5 8401 via
#!text pancake sequence -p Shigella_flexneri.pan -g 'Shigella flexneri 5 8401' -o S_flex_5_8401.fas
NOTE Line width of the resulting .fasta file can be determined by setting -l <line_width>
. By default, line width is 100.
NOTE If no output file is specified via parameter -o <FASTA_FILE>
, sequence is printed to STDOUT.
NOTE Sequence of a single chromosome can be retrieved by setting parameter -c <CHROM_NAME>
instead of parameter -g <GENOME_NAME>
.
Include alignment information
For retrieval of pairwise alignment information we highly recommend to use nucmer with parameter --maxmatch
set, e.g. by calling
#!text nucmer --maxmatch -p out1 S_flex_2a_301.fas S_flex_5_8401.fas
#!text nucmer --maxmatch -p out2 S_flex_2a_301.fas S_flex_2a_2457T.fas
which produces .delta files out1.delta and out2.delta.
Then, alignment information can be integrated straightforwardly via
#!text pancake addAli -p Shigella_flexneri.pan out1.delta out2.delta
producing output
#!text ...pangenome loaded from Shigella_flexneri.pan # ...parsing .delta file out1.delta ...adding alignment 1000 (153 SFs created) ...adding alignment 2000 (273 SFs created) ...adding alignment 3000 (430 SFs created) ...adding alignment 4000 (525 SFs created) ...adding alignment 5000 (581 SFs created) ...adding alignment 6000 (632 SFs created) ...adding alignment 7000 (684 SFs created) ...adding alignment 8000 (725 SFs created) ...adding alignment 9000 (770 SFs created) ...adding alignment 10000 (828 SFs created) ...adding alignment 11000 (879 SFs created) ...adding alignment 12000 (910 SFs created) ...adding alignment 13000 (947 SFs created) ...adding alignment 14000 (973 SFs created) ...adding alignment 15000 (1006 SFs created) ...adding alignment 16000 (1052 SFs created) ...adding alignment 17000 (1076 SFs created) ...adding alignment 18000 (1119 SFs created) ...adding alignment 19000 (1168 SFs created) ...found 19271 valid pairwise alignments # ...check for non-similarity SFs ...downgrade non-aligned Feature Instances ...concatenate Shared Features # # ...parsing .delta file out2.delta ...adding alignment 1000 (1256 SFs created) ...adding alignment 2000 (1277 SFs created) ...adding alignment 3000 (1279 SFs created) ...adding alignment 4000 (1280 SFs created) ...adding alignment 5000 (1298 SFs created) ...adding alignment 6000 (1307 SFs created) ...adding alignment 7000 (1305 SFs created) ...adding alignment 8000 (1297 SFs created) ...adding alignment 9000 (1301 SFs created) ...adding alignment 10000 (1301 SFs created) ...adding alignment 11000 (1315 SFs created) ...adding alignment 12000 (1333 SFs created) ...adding alignment 13000 (1333 SFs created) ...adding alignment 14000 (1333 SFs created) ...adding alignment 15000 (1339 SFs created) ...adding alignment 16000 (1341 SFs created) ...adding alignment 17000 (1342 SFs created) ...adding alignment 18000 (1353 SFs created) ...adding alignment 19000 (1353 SFs created) ...found 19356 valid pairwise alignments # ...check for non-similarity SFs ...downgrade non-aligned Feature Instances ...concatenate Shared Features # ...PanCake Object written to Shigella_flexneri.pan (6025488 bytes)
and turns status information about your PanCake Data Structure into
#!text PanGenome Object consists of 216 un-aligned FIs & 96149 aligned FIs (organized in 1415 Shared Features) # 4 chromosomes representing 3 genomes, namely: # Genome Shigella flexneri 5 8401 >NC_008258, gi|110804074|ref|NC_008258.1| (4574284bp) --> 100 un-aligned Feature Instances (mean length 1354.27) --> 30292 aligned Feature Instances (mean length 146.5356199656675) in 1310 Shared Features # Genome Shigella flexneri 2a 301 >NC_004337, gi|344915202|ref|NC_004337.2| (4607202bp) --> 24 un-aligned Feature Instances (mean length 808.2916666666666) --> 31310 aligned Feature Instances (mean length 146.528361545832) in 1406 Shared Features >NC_004851, gi|31983523|ref|NC_004851.1| (221618bp) --> 55 un-aligned Feature Instances (mean length 2849.0363636363636) --> 3793 aligned Feature Instances (mean length 17.116003163722645) in 331 Shared Features # Genome Shigella flexneri 2a 2457T >NC_004741, gi|30061571|ref|NC_004741.1| (4599354bp) --> 37 un-aligned Feature Instances (mean length 980.5135135135135) --> 30754 aligned Feature Instances (mean length 148.3733823242505) in 1391 Shared Features
NOTE You can specify an alternative output file via parameter -o
. In this case, your original .pan file stays unchanged.
NOTE PanCake identifies automatically if your input alignment file is a .delta file produced by nucmer or BLAST output.
NOTE You can filter input alignments by minimum length by specifying -l <MIN_LEN>
(default MIN_LEN =25) or skip pairwise alignments between regions on the same chromosome by setting flag -nsa
.
Core Identification
All subsequences with a minmum length of 100bp originating from Shigella flexneri 2a 301 and being part of the core region corresponding to all 3 strains under consideration can be determined by calling
#!text pancake core -p Shigella_flexneri.pan -rg 'Shigella flexneri 2a 301' -l 100
which produces the following output
#!text # ...parsing NC_004337 against GENOMES Shigella flexneri 5 8401 Shigella flexneri 2a 2457T # Found 31175 core FIs! # ...1000 Feature Instances parsed (current chromosome position = 273345) ...2000 Feature Instances parsed (current chromosome position = 325635) ...3000 Feature Instances parsed (current chromosome position = 478802) ...4000 Feature Instances parsed (current chromosome position = 699355) ...5000 Feature Instances parsed (current chromosome position = 723646) ...6000 Feature Instances parsed (current chromosome position = 903717) ...7000 Feature Instances parsed (current chromosome position = 917852) ...8000 Feature Instances parsed (current chromosome position = 1070392) ...9000 Feature Instances parsed (current chromosome position = 1209241) ...10000 Feature Instances parsed (current chromosome position = 1391363) ...11000 Feature Instances parsed (current chromosome position = 1414815) ...12000 Feature Instances parsed (current chromosome position = 1467496) ...13000 Feature Instances parsed (current chromosome position = 1574999) ...14000 Feature Instances parsed (current chromosome position = 1632951) ...15000 Feature Instances parsed (current chromosome position = 1828509) ...16000 Feature Instances parsed (current chromosome position = 1850037) ...17000 Feature Instances parsed (current chromosome position = 1939729) ...18000 Feature Instances parsed (current chromosome position = 2051074) ...19000 Feature Instances parsed (current chromosome position = 2083925) ...20000 Feature Instances parsed (current chromosome position = 2204771) ...21000 Feature Instances parsed (current chromosome position = 2554227) ...22000 Feature Instances parsed (current chromosome position = 2756763) ...23000 Feature Instances parsed (current chromosome position = 2964538) ...24000 Feature Instances parsed (current chromosome position = 3237950) ...25000 Feature Instances parsed (current chromosome position = 3403734) ...26000 Feature Instances parsed (current chromosome position = 3677137) ...27000 Feature Instances parsed (current chromosome position = 3834731) ...28000 Feature Instances parsed (current chromosome position = 3915840) ...29000 Feature Instances parsed (current chromosome position = 4177332) ...30000 Feature Instances parsed (current chromosome position = 4462470) ...31000 Feature Instances parsed (current chromosome position = 4540504) Found 1965 core regions. Minimum Length = 100 Maximum Length = 80052 Mean = 2939.206106870229 # ...parsing NC_004851 against GENOMES Shigella flexneri 5 8401 Shigella flexneri 2a 2457T # Found 3770 core FIs! # ...1000 Feature Instances parsed (current chromosome position = 58493) ...2000 Feature Instances parsed (current chromosome position = 136600) ...3000 Feature Instances parsed (current chromosome position = 197572) Found 209 core regions. Minimum Length = 101 Maximum Length = 1692 Mean = 369.444976076555
A summary of all identified regions is written automatically to .bed file core_<REF_GENOME>.bed
(i.e. core_Shigella flexneri 2a 301.bed in this case).
Furthermore, in automatically created folder core_<REF_GENOME>
(i.e. core_Shigella flexneri 2a 301) each identified core region including all corresponding subsequences is given in a separate multiple .fas format. An alternative name of that folder can be determined by parameter -o <DIRECTORY>
. Multiple .fasta output can be supressed completely by setting flag -no
.
All subregions on chromosome NC_004851 similar to subsequences in Shigella flexneri 5 8401 can be identified via
#!text pancake core -rc NC_004851 -nrg 'Shigella flexneri 5 8401' -p Shigella_flexneri.pan -no
which would be equivalent to
#!text pancake core -rc NC_004851 -eg 'Shigella flexneri 2a 2457T' -p Shigella_flexneri.pan -no
NOTE Both calls produce .bed file core_NC_004851.bed. Output of identified core regions in multiple FASTA format is supressed by flag -no
.
Updated