Wiki

Clone wiki

PanCake / Documentation

#Initialization of a PanCake Data Object

Call

pancake create

by providing at least one DNA sequence. This can be done by specification of

  • .fasta or multiple .fasta file(s) via parameter -s.

  • (a list of) gi ids via parameter -i. Corresponding sequence data will be downloaded from the NCBI data resource. If using this utility make sure to provide your email address via parameter --email. In case of misapplication NCBI will contact a user at the e-mail address prior to blocking access.

Sequence names will be parsed automatically from input files, but can be changed subsequently by calling pancake specify.

Use parameter -p to specify the text file, your PanCake Data Object will be written to. If no output file is specified, it is written to ./pan_files/pancake.pan.

It is highly recommended, to include information from pairwise alignments immediatly into a newly created PanCake Data Object, as it will decrease the output file's size noticeably.

#Including pairwise Alignment Information

Information from pairwise alignment can be included during execution of

  • pancake create (PanCake Data Object Initialization)

  • pancake addChromosomes (add Chromosomes to an existing PanCake Object)

  • pancake addAli (include alignment information into an existing PanCake Object)

by parameter -a <ALIGNMENT_FILE>.

Currently, alignments can be provided by two input file types, namely

  • BLAST's default output format type (defined as 'pairwise', ouput format 0)

  • .delta files, the output file format of nucmer

Types of provided alignment files are detected automatically.

Alignments can be filtered by the following additional parameters:

  • -l <MIN_LEN>: each of the sequences in a pairwise alignment has to be of length equal or greater than MIN_LEN to be included (DEFAULT: MIN_LEN=25)

  • -nsa: if set, pairwise alignments between regions on identical chromosomes are excluded from input alignments (DEFAULT=False)

You can write the emerging PanCake Data Object to a new .pan file specified by paramter -o <NEW_PAN_FILE>. This will leave the original PanCake Object unchanged.

#PanCake Object Overview

An overview about a PanCake Data Object is retrieved by calling pancake status <PAN_FILE>

Example output:

#!text
PanGenome Object consists of 117 un-aligned FIs & 71236 aligned FIs (organized in 3105 Shared Features)
#
9 chromosomes representing 3 genomes, namely:
#
Genome A. baumannii 1656-2
>CP001923, gi|322509998|gb|CP001923.1|(8041bp)
--> 4 un-aligned Feature Instances (mean length 787.5)
--> 48 aligned Feature Instances (mean length 101.89583333333333) in 24 Shared Features
>CP001921, gi|322506180|gb|CP001921.1|(3940614bp)
--> 36 un-aligned Feature Instances (mean length 1173.9166666666667)
--> 21953 aligned Feature Instances (mean length 177.5772331799754) in 2562 Shared Features
>CP001922, gi|322509896|gb|CP001922.1|(74451bp)
--> 9 un-aligned Feature Instances (mean length 765.8888888888889)
--> 154 aligned Feature Instances (mean length 438.68831168831167) in 61 Shared Features
#
Genome A. baumannii AYE
>gi|169147133|emb|CU459141.1|(3936291bp)
--> 29 un-aligned Feature Instances (mean length 3283.7586206896553)
--> 24662 aligned Feature Instances (mean length 155.74819560457385) in 2399 Shared Features
>gi|169147044|emb|CU459139.1|(2726bp)
--> 1 un-aligned Feature Instances (mean length 2726.0)
--> 0 aligned Feature Instances (mean length 0) in 0 Shared Features
>gi|169147050|emb|CU459140.1|(94413bp)
--> 17 un-aligned Feature Instances (mean length 4591.117647058823)
--> 1128 aligned Feature Instances (mean length 14.50709219858156) in 166 Shared Features
>gi|169147024|emb|CU459137.1|(5644bp)
--> 2 un-aligned Feature Instances (mean length 2385.0)
--> 30 aligned Feature Instances (mean length 29.133333333333333) in 21 Shared Features
>gi|169147032|emb|CU459138.1|(9661bp)
--> 1 un-aligned Feature Instances (mean length 1592.0)
--> 42 aligned Feature Instances (mean length 192.11904761904762) in 33 Shared Features
#
Genome A. baumannii AB307-0294
>gi|213985689|gb|CP001172.1|(3760981bp)
--> 18 un-aligned Feature Instances (mean length 2637.0555555555557)
--> 23219 aligned Feature Instances (mean length 159.93427796201388) in 2252 Shared Features

#Rename Chromosomes and specify Genomes

By default, chromosome names are parsed automatically from input sequence files, and each chromosome belongs to its own genome with genome name identical to chromosome's name. Once a PanCake Object is intialized, changing chromosome names as well as assignment of chromosomes to genomes is done by calling pancake specify.

The most convinient way of specifying genomes and change chromosome names is providing a tab-separated file with parameter -f <file> like

#!text
Genome1    Chromosome1.1    new name of Chromosome1.1
Genome1    Chromosome1.2    new name of Chromosome1.2
Genome2    Chromosome2.1
Genome3    Chromosome3.1 
Genome3    Chromosome3.2    new name of Chromosome3.2
This example file would result in a pangenome including Genome1 (consisting of Chromosome1.1 and Chromosome1.2), Genome2 (consisting of Chromosome2.1) and Genome3 (consisiting of Chromosome3.1 and Chromosome3.2). The third column, providing new names for the chromosomes in the 2nd column, is optional. In order to apply these specifications to an existing PanCake Object type

#!text
pancake specify -p path/to/your/panfile -f <file>

NOTE Chromosomes are allowed to have several names, but not two chromosomes are allowed to have the same name. Whenever a chromosome name is specified, this will lead to an additional name by which the corresponding chromosome can be addressed.

You can delete a chromosome's name via pancake specify -p file/to/your/panfile -d <name_to delete>. If <name_to delete> is the only name of the corresponding chromosome, PanCake will warn you and interrupt.

Chromosome names and genomes can be specified separately as well. Add an additional chromosome name via

#!text
pancake specify -p path/to/your/panfile -c <chromosome> -n <new_name>

Group chromosomes into genomes via

#!text
pancake specify -p path/to/your/panfile -c <chromosome1.1> <chromosome1.2> ... -g <genome_name>

NOTE In contrast to chromosomes, genomes are only allowed to have a single name. If you specify chrom1 belonging to genome G, and subsequently specify chrom2 belonging to G, chrom1 and chrom2 are part of the same genome.

#Identification of Singletons

To get a genome's set of singleton regions type:

#!text
pancake singletons -p path/to/your/panfile -rg <genome> 
or, in order to identify the set of singletons on a single chromosome:
#!text
pancake singletons -p path/to/your/panfile -rc <chromosome> 

Identified singleton regions are filtered by their length. You can specify a minimum length of valid regions via -l <min_length>.

By default PanCake computes singleton regions dependent on ALL genomes included in the PanCake Object given by the specified panfile. In order to curtail this to a subset of genomes (i.e. chromosomes) ther exist four possibilities, namely

-- specify the genome set explicitly via -nrg <genome1> <genome2> <genome3>

-- specify the chromosomes to which to compare with explicitly via -nrc <chrom1> <chrom2> <chrom3>

-- exclude genomes from the set of genomes to compare with via -eg <genome1> <genome2> <genome3>

-- exclude chromosomes from the set of chromosomes to compare with via -ec <chrom1> <chrom2> <chrom3>

You are allowed to state any combinations of the given parameters, PanCake will give you an overview of the final set of all genomes and chromosomes which are considered in comparison.

Computation of singletons will always produce a bed file containing identified regions. By default, this is <genome>.bed, respectively <chromosome>.bed. You can specify an alternative file name by parameter -b <filename>.

By default, PanCake also produces a folder singletons_<genome>/, respectively singletons_<chromosome>/ containing a FASTA per identified region. FASTA output can be suppressed by setting flag -no. An alternative output directory can be obtained via -o <folder_name>.

#Identification of Core Genes

In order get a genome's core regions type:

#!text
pancake core -p path/to/your/panfile -rg <genome> 
or on a single chromosome:
#!text
pancake core -p path/to/your/panfile -rc <chromosome> 

In general, computation of core regions depends on identical parameters as the identification of singletons (i.e. specify a minimum length, specify the set of genomes or chromosomes to compare with -nrg, -nrc, -eg or -ec, specify output files by -b and -o, or supress FASTA output by setting flag -no).

However, core regions are determined due to two additional parameters, namely

-- the maximum space (i.e. count of base pairs) allowed between two consecutive core regions in order to summarize them into one valid core region. This maximum space can be specified via parameter -s <max_space> (default=25).

-- the maximum fraction of an identified core region allowed to be not aligned to ALL chromosomes under consideration. This fraction can be specified via parameter -f <max_frac> (default=0.05). See an example below.

EXAMPLE Consider a chromosome with subsequences between positions 1 and 30, as well as between positions 50 and 75, identified as part of the core genome. The region between 31 and 49 is known to be not aligned to all other genomes under consideration, and hence, not a valid core region part.
Then, merging the region between 1 and 75 into one big core region would yield a fraction of non-valid positions of (49-31+1)/75=0.253 of within the resulting 'core region'. As 0.253 is greater than maximum fraction 0.05, both core regions will appear separately in output.

#Retrieval of sequence data

At any time, a PanCake Object provides errorless sequence retrieval of all chromosomes included.

Calling

#!text
pancake sequence -p <panfile> -start <start_pos> -stop <stop_pos> <chromosome>
will print the sequence of <chromosome> between<start_pos> and <stop_pos> (inclusively) on standard output.

If an output file is provided via -f <file_name>, sequence is written to <filename> (including FASTA header). Thereby, line breaks are included every l-th position. l can be specified by parameter -l <l>. By default, a line of the resulting FASTA file will be of length l=100.

If no <start_pos> is defined, the sequence starts at base position 1 on <chromosome>. Accordingly, if no stop position is specified, the sequence stops at last base of <chromosome>.

Updated