Wiki

Clone wiki

PanCake / Example Workflow

This is an example of a PanCake workflow analyzing 3 strains of Shigella flexneri, namely

  • S. flexneri 2a 2457T: 1 chromosome of length 4,599,354 bp (accession number NC_004741, gi 30061571)

  • S. flexneri 2a 301: 1 chromosome of length 4,607,202 bp (accession number NC_004337, gi 344915202) & 1 plasmid of length 221,618 bp (accession number NC_004851, gi 31983523)

  • S. flexneri 5 8401:1 chromosome of length 4,574,284 bp (accession number NC_008258, gi 110804074)

Initialization of the PanCake Data Structure

For initialization, complete DNA sequences are required. These can be retrieved directly from NCBI by indicating appropriate sequence identifiers.

A PanCake pangenome object including S. flexneri 2a 2457T and S. flexneri 2a 301 is initialized via

#!text
pancake create -i NC_004741 NC_004337 31983523 110804074 -e your.e-mail@address -p Shigella_flexneri.pan

which produces the following output

#!text
#
...pangenome initialized
...PanCake Object written to Shigella_flexneri.pan (14002741 bytes)

NOTE It is highly recommended to provide your e-mail address via parameter --email or -e as in case of misapplication NCBI will contact you at the e-mail address prior to blocking access. If no email address is specified, there will be a warning.

NOTE By specifying sequence ids via -iyou can swap arbitrarily between accession numbers and gi numbers.

NOTE By default, the PanCake data structure is initialized at ./pan_files/pancake.pan. An alternative output file is specified via parameter -p.

Assuming that sequence data of S. flexneri 2a 301 is already present in a (multiple) .fasta file on your system this can be specified via -s:

#!text
pancake create -i NC_004741 110804074 -s path/to/Shigella_flexneri_2a_301.fas -e your.e-mail@address -p Shigella_flexneri.pan

Get information about your PanCake Data Structure

Once the PanCake data structure is initialized, some basic information can be retrieved via

#!text
pancake status Shigella_flexneri.pan
which produces the following output
#!text
PanGenome Object consists of 4 un-aligned FIs & 0 aligned FIs (organized in 0 Shared Features)
#
4 chromosomes representing 4 genomes, namely:
#
Genome gi|30061571|ref|NC_004741.1|
>gi|30061571|ref|NC_004741.1| (4599354bp)
--> 1 un-aligned Feature Instances (mean length 4599354.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
#
Genome gi|31983523|ref|NC_004851.1|
>gi|31983523|ref|NC_004851.1| (221618bp)
--> 1 un-aligned Feature Instances (mean length 221618.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
#
Genome gi|344915202|ref|NC_004337.2|
>gi|344915202|ref|NC_004337.2| (4607202bp)
--> 1 un-aligned Feature Instances (mean length 4607202.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
#
Genome gi|110804074|ref|NC_008258.1|
>gi|110804074|ref|NC_008258.1| (4574284bp)
--> 1 un-aligned Feature Instances (mean length 4574284.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features

By default, PanCake assumes that each chromosome constitutes its own genome. Furthermore, chromosome names are parsed automatically from sequence data.

Specify genomes and change chromosome names

You can determine chromosomes gi|344915202|ref|NC_004337.2| and gi|31983523|ref|NC_004851.1| as belonging to the same genome Shigella flexneri 2a 301 via

#!text
pancake specify -p Shigella_flexneri.pan -c 'gi|344915202|ref|NC_004337.2|' 'gi|31983523|ref|NC_004851.1|' -g 'Shigella flexneri 2a 301'

Then, output of pancake status Shigella_flexneri.pan turns into

#!text
PanGenome Object consists of 4 un-aligned FIs & 0 aligned FIs (organized in 0 Shared Features)
#
4 chromosomes representing 3 genomes, namely:
#
Genome gi|30061571|ref|NC_004741.1|
>gi|30061571|ref|NC_004741.1| (4599354bp)
--> 1 un-aligned Feature Instances (mean length 4599354.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
#
Genome Shigella flexneri 2a 301
>gi|31983523|ref|NC_004851.1| (221618bp)
--> 1 un-aligned Feature Instances (mean length 221618.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
>gi|344915202|ref|NC_004337.2| (4607202bp)
--> 1 un-aligned Feature Instances (mean length 4607202.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
#
Genome gi|110804074|ref|NC_008258.1|
>gi|110804074|ref|NC_008258.1| (4574284bp)
--> 1 un-aligned Feature Instances (mean length 4574284.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features

Correspondingly, define genome S. flexneri 2a 2457T via

#!text
pancake specify -p Shigella_flexneri.pan -c 'gi|30061571|ref|NC_004741.1|' -g 'Shigella flexneri 2a 2457T'
and genome S. flexneri 5 8401 via
#!text
pancake specify -p Shigella_flexneri.pan -c 'gi|110804074|ref|NC_008258.1|' -g 'Shigella flexneri 5 8401'

As the current (default) chromosome names are unhandy, let's replace them by pure accession numbers:

#!text
pancake specify -p Shigella_flexneri.pan -c 'gi|30061571|ref|NC_004741.1|' -n NC_004741

#!text
pancake specify -p Shigella_flexneri.pan -c 'gi|344915202|ref|NC_004337.2|' -n NC_004337
#!text
pancake specify -p Shigella_flexneri.pan -c 'gi|31983523|ref|NC_004851.1|' -n NC_004851
#!text
pancake specify -p Shigella_flexneri.pan -c 'gi|110804074|ref|NC_008258.1|' -n NC_008258

Subsequent output of pancake status Shigella_flexneri.pan:

#!text
PanGenome Object consists of 4 un-aligned FIs & 0 aligned FIs (organized in 0 Shared Features)
#
4 chromosomes representing 3 genomes, namely:
#
Genome Shigella flexneri 5 8401
>NC_008258, gi|110804074|ref|NC_008258.1| (4574284bp)
--> 1 un-aligned Feature Instances (mean length 4574284.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
#
Genome Shigella flexneri 2a 301
>NC_004851, gi|31983523|ref|NC_004851.1| (221618bp)
--> 1 un-aligned Feature Instances (mean length 221618.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
>NC_004337, gi|344915202|ref|NC_004337.2| (4607202bp)
--> 1 un-aligned Feature Instances (mean length 4607202.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
#
Genome Shigella flexneri 2a 2457T
>NC_004741, gi|30061571|ref|NC_004741.1| (4599354bp)
--> 1 un-aligned Feature Instances (mean length 4599354.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features

From now on, chromosomes can be addressed by both of their names.

Genome specification and chromosome renaming can be performed more conviniently by providing a special tab-separated specification file. The previously mentioned changes could have been performed as well by calling

#!text
pancake specify -p Shigella_flexneri.pan -f specification.txt
with specification.txt containing
#!text
Shigella flexneri 2a 301  gi|31983523|ref|NC_004851.1|   NC_004851
Shigella flexneri 2a 301  gi|344915202|ref|NC_004337.2|  NC_004337
Shigella flexneri 2a 2457T   gi|30061571|ref|NC_004741.1|   NC_004741
Shigella flexneri 5 8401  gi|110804074|ref|NC_008258.1|  NC_008258

##Sequence Retrieval

For each chromosome (or genome) contained in a PanCake data structure a corresponding (multiple) .fasta file can be retrieved. For example, retrieval of genomic sequence of S. flexneri 2a 301 is done via

#!text
pancake sequence -p Shigella_flexneri.pan -g 'Shigella flexneri 2a 301' -o S_flex_2a_301.fas 
which produces multiple .fasta file S_flex_2a_301.fas including the sequence of all chromosomes orginating from S. flexneri 2a 301. Correspondingly create sequence data of S. flexneri 2a 2457T via

#!text
pancake sequence -p Shigella_flexneri.pan -g 'Shigella flexneri 2a 2457T' -o S_flex_2a_2457T.fas 

and sequence data of S. flexneri 5 8401 via

#!text
pancake sequence -p Shigella_flexneri.pan -g 'Shigella flexneri 5 8401' -o S_flex_5_8401.fas 

NOTE Line width of the resulting .fasta file can be determined by setting -l <line_width>. By default, line width is 100.

NOTE If no output file is specified via parameter -o <FASTA_FILE>, sequence is printed to STDOUT.

NOTE Sequence of a single chromosome can be retrieved by setting parameter -c <CHROM_NAME> instead of parameter -g <GENOME_NAME>.

Include alignment information

For retrieval of pairwise alignment information we highly recommend to use nucmer with parameter --maxmatch set, e.g. by calling

#!text
nucmer --maxmatch -p out1 S_flex_2a_301.fas S_flex_5_8401.fas 
and

#!text
nucmer --maxmatch -p out2 S_flex_2a_301.fas S_flex_2a_2457T.fas 

which produces .delta files out1.delta and out2.delta.

Then, alignment information can be integrated straightforwardly via

#!text
pancake addAli -p Shigella_flexneri.pan out1.delta out2.delta

producing output

#!text
...pangenome loaded from Shigella_flexneri.pan
#
...parsing .delta file out1.delta
...adding alignment 1000 (153 SFs created)
...adding alignment 2000 (273 SFs created)
...adding alignment 3000 (430 SFs created)
...adding alignment 4000 (525 SFs created)
...adding alignment 5000 (581 SFs created)
...adding alignment 6000 (632 SFs created)
...adding alignment 7000 (684 SFs created)
...adding alignment 8000 (725 SFs created)
...adding alignment 9000 (770 SFs created)
...adding alignment 10000 (828 SFs created)
...adding alignment 11000 (879 SFs created)
...adding alignment 12000 (910 SFs created)
...adding alignment 13000 (947 SFs created)
...adding alignment 14000 (973 SFs created)
...adding alignment 15000 (1006 SFs created)
...adding alignment 16000 (1052 SFs created)
...adding alignment 17000 (1076 SFs created)
...adding alignment 18000 (1119 SFs created)
...adding alignment 19000 (1168 SFs created)
...found 19271 valid pairwise alignments
#
...check for non-similarity SFs
...downgrade non-aligned Feature Instances
...concatenate Shared Features
#
#
...parsing .delta file out2.delta
...adding alignment 1000 (1256 SFs created)
...adding alignment 2000 (1277 SFs created)
...adding alignment 3000 (1279 SFs created)
...adding alignment 4000 (1280 SFs created)
...adding alignment 5000 (1298 SFs created)
...adding alignment 6000 (1307 SFs created)
...adding alignment 7000 (1305 SFs created)
...adding alignment 8000 (1297 SFs created)
...adding alignment 9000 (1301 SFs created)
...adding alignment 10000 (1301 SFs created)
...adding alignment 11000 (1315 SFs created)
...adding alignment 12000 (1333 SFs created)
...adding alignment 13000 (1333 SFs created)
...adding alignment 14000 (1333 SFs created)
...adding alignment 15000 (1339 SFs created)
...adding alignment 16000 (1341 SFs created)
...adding alignment 17000 (1342 SFs created)
...adding alignment 18000 (1353 SFs created)
...adding alignment 19000 (1353 SFs created)
...found 19356 valid pairwise alignments
#
...check for non-similarity SFs
...downgrade non-aligned Feature Instances
...concatenate Shared Features
#
...PanCake Object written to Shigella_flexneri.pan (6025488 bytes)

and turns status information about your PanCake Data Structure into

#!text
PanGenome Object consists of 216 un-aligned FIs & 96149 aligned FIs (organized in 1415 Shared Features)
#
4 chromosomes representing 3 genomes, namely:
#
Genome Shigella flexneri 5 8401
>NC_008258, gi|110804074|ref|NC_008258.1| (4574284bp)
--> 100 un-aligned Feature Instances (mean length 1354.27)
--> 30292 aligned Feature Instances (mean length 146.5356199656675) in 1310 Shared Features
#
Genome Shigella flexneri 2a 301
>NC_004337, gi|344915202|ref|NC_004337.2| (4607202bp)
--> 24 un-aligned Feature Instances (mean length 808.2916666666666)
--> 31310 aligned Feature Instances (mean length 146.528361545832) in 1406 Shared Features
>NC_004851, gi|31983523|ref|NC_004851.1| (221618bp)
--> 55 un-aligned Feature Instances (mean length 2849.0363636363636)
--> 3793 aligned Feature Instances (mean length 17.116003163722645) in 331 Shared Features
#
Genome Shigella flexneri 2a 2457T
>NC_004741, gi|30061571|ref|NC_004741.1| (4599354bp)
--> 37 un-aligned Feature Instances (mean length 980.5135135135135)
--> 30754 aligned Feature Instances (mean length 148.3733823242505) in 1391 Shared Features

NOTE You can specify an alternative output file via parameter -o. In this case, your original .pan file stays unchanged. NOTE PanCake identifies automatically if your input alignment file is a .delta file produced by nucmer or BLAST output. NOTE You can filter input alignments by minimum length by specifying -l <MIN_LEN> (default MIN_LEN =25) or skip pairwise alignments between regions on the same chromosome by setting flag -nsa.

Core Identification

All subsequences with a minmum length of 100bp originating from Shigella flexneri 2a 301 and being part of the core region corresponding to all 3 strains under consideration can be determined by calling

#!text
pancake core -p Shigella_flexneri.pan -rg 'Shigella flexneri 2a 301' -l 100

which produces the following output

#!text
#
...parsing NC_004337
against
GENOMES
 Shigella flexneri 5 8401
 Shigella flexneri 2a 2457T
#
Found 31175 core FIs!
#
...1000 Feature Instances parsed (current chromosome position = 273345)
...2000 Feature Instances parsed (current chromosome position = 325635)
...3000 Feature Instances parsed (current chromosome position = 478802)
...4000 Feature Instances parsed (current chromosome position = 699355)
...5000 Feature Instances parsed (current chromosome position = 723646)
...6000 Feature Instances parsed (current chromosome position = 903717)
...7000 Feature Instances parsed (current chromosome position = 917852)
...8000 Feature Instances parsed (current chromosome position = 1070392)
...9000 Feature Instances parsed (current chromosome position = 1209241)
...10000 Feature Instances parsed (current chromosome position = 1391363)
...11000 Feature Instances parsed (current chromosome position = 1414815)
...12000 Feature Instances parsed (current chromosome position = 1467496)
...13000 Feature Instances parsed (current chromosome position = 1574999)
...14000 Feature Instances parsed (current chromosome position = 1632951)
...15000 Feature Instances parsed (current chromosome position = 1828509)
...16000 Feature Instances parsed (current chromosome position = 1850037)
...17000 Feature Instances parsed (current chromosome position = 1939729)
...18000 Feature Instances parsed (current chromosome position = 2051074)
...19000 Feature Instances parsed (current chromosome position = 2083925)
...20000 Feature Instances parsed (current chromosome position = 2204771)
...21000 Feature Instances parsed (current chromosome position = 2554227)
...22000 Feature Instances parsed (current chromosome position = 2756763)
...23000 Feature Instances parsed (current chromosome position = 2964538)
...24000 Feature Instances parsed (current chromosome position = 3237950)
...25000 Feature Instances parsed (current chromosome position = 3403734)
...26000 Feature Instances parsed (current chromosome position = 3677137)
...27000 Feature Instances parsed (current chromosome position = 3834731)
...28000 Feature Instances parsed (current chromosome position = 3915840)
...29000 Feature Instances parsed (current chromosome position = 4177332)
...30000 Feature Instances parsed (current chromosome position = 4462470)
...31000 Feature Instances parsed (current chromosome position = 4540504)
Found 1965 core regions.
Minimum Length = 100
Maximum Length = 80052
Mean = 2939.206106870229

#
...parsing NC_004851
against
GENOMES
 Shigella flexneri 5 8401
 Shigella flexneri 2a 2457T
#
Found 3770 core FIs!
#
...1000 Feature Instances parsed (current chromosome position = 58493)
...2000 Feature Instances parsed (current chromosome position = 136600)
...3000 Feature Instances parsed (current chromosome position = 197572)
Found 209 core regions.
Minimum Length = 101
Maximum Length = 1692
Mean = 369.444976076555

A summary of all identified regions is written automatically to .bed file core_<REF_GENOME>.bed (i.e. core_Shigella flexneri 2a 301.bed in this case). Furthermore, in automatically created folder core_<REF_GENOME> (i.e. core_Shigella flexneri 2a 301) each identified core region including all corresponding subsequences is given in a separate multiple .fas format. An alternative name of that folder can be determined by parameter -o <DIRECTORY>. Multiple .fasta output can be supressed completely by setting flag -no.

All subregions on chromosome NC_004851 similar to subsequences in Shigella flexneri 5 8401 can be identified via

#!text
pancake core -rc NC_004851 -nrg 'Shigella flexneri 5 8401' -p Shigella_flexneri.pan -no

which would be equivalent to

#!text
pancake core -rc NC_004851 -eg 'Shigella flexneri 2a 2457T' -p Shigella_flexneri.pan -no

NOTE Both calls produce .bed file core_NC_004851.bed. Output of identified core regions in multiple FASTA format is supressed by flag -no.

Updated