Clone wiki

PanPhlAn / Tutorial

For a quick introduction, read our PanPhlAn wiki


PanPhlAn Tutorial

Screening for Eubacterium rectale strains in gut metagenomes

From the Human Microbiome Project (Huttenhower, 2012), we select 4 samples. Two of each are from the same subject and hence are expected to contain the same strain.

a) Download and install PanPhlAn

hg clone https://bitbucket.org/CibioCM/panphlan

see also: → Download and Installation.

b) Download metagenomic samples

The 4 samples from the Human Microbiome Project are of subjects ID_763496533 and ID_763577454 at 2 different times.

sampleID_subjectID

SRS013951_763496533
SRS019161_763496533

SRS014459_763577454
SRS015065_763577454

We download and save the metagenomes into the folder 'samples'

wget -P samples https://www.dropbox.com/sh/5vxicumnz3adiwi/AACyBh4CUJFJ1P6a-ypzHvlua/SRS013951.tar.bz2
wget -P samples https://www.dropbox.com/sh/5vxicumnz3adiwi/AAA4CVLuQVzg9-9ql4pAQxL0a/SRS014459.tar.bz2
wget -P samples https://www.dropbox.com/sh/5vxicumnz3adiwi/AABslonWx4_V7L9ugEbSN3XFa/SRS015065.tar.bz2
wget -P samples https://www.dropbox.com/sh/5vxicumnz3adiwi/AACanDqRyhaX0G6Dtf4NPTVwa/SRS019161.tar.bz2

Now folder samples/ contains 4 files

ls samples/
SRS013951.tar.bz2  SRS014459.tar.bz2  SRS015065.tar.bz2  SRS019161.tar.bz2

c) Download the species pangenome database

To screen our samples for Eubacterium rectale strains, we need to download the related pangenome database:
erectale15 Eubacterium rectale (version 2015).

Download and decompress database files:

wget http://www.matthias-scholz.de/panphlan_erectale15.zip
unzip panphlan_erectale15.zip

Our species database includes 8 files:

ls 
 panphlan_erectale15.1.bt2
 panphlan_erectale15.2.bt2
 panphlan_erectale15.3.bt2
 panphlan_erectale15.4.bt2
 panphlan_erectale15_centroids.ffn
 panphlan_erectale15_pangenome.csv
 panphlan_erectale15.rev.1.bt2
 panphlan_erectale15.rev.2.bt2

If you have a $BOWTIE2_INDEXES directory, you can move the database files there. Otherwise keep the files in your working directory.

mv panphlan_erectale15* $BOWTIE2_INDEXES

Note, you can create your own database using the script panphlan_pangenome_generation.py or simply download our pre-processed pangenome databases provided for more than 400 species: → download page

d) Map samples against Eubacterium rectale pangenome

We run panphlan_map on each sample using the E. rectale pangenome database

./panphlan/panphlan_map.py -c erectale15 -i samples/SRS013951.tar.bz2 -o map_results/SRS013951_erectale15.csv --verbose
./panphlan/panphlan_map.py -c erectale15 -i samples/SRS014459.tar.bz2 -o map_results/SRS014459_erectale15.csv --verbose
./panphlan/panphlan_map.py -c erectale15 -i samples/SRS015065.tar.bz2 -o map_results/SRS015065_erectale15.csv --verbose
./panphlan/panphlan_map.py -c erectale15 -i samples/SRS019161.tar.bz2 -o map_results/SRS019161_erectale15.csv --verbose

Now we have the mapping results of all samples. These files are used in the next step to extract strain-specific gene-family profiles.

ls map_results/
SRS013951_erectale15.csv.bz2 SRS014459_erectale15.csv.bz2 SRS015065_erectale15.csv.bz2  SRS019161_erectale15.csv.bz2

read more about → panphlan_map.py

e) Merge mapping results into final gene-family profiles

We use panphlan_profile.py to get the final gene-family presence/absence table. By using option --add_strains, we also add reference genome profiles to the profile table which later can be used in a combined analysis of sample-strains and reference genomes.

./panphlan/panphlan_profile.py -c erectale15 -i map_results/ --o_dna result_gene_presence_absence.csv --add_strains --verbose

Gene-family profiles
File result_gene_presence_absence.csv contains presence/absence (1 or 0) gene-family profiles for strains detected in our 4 samples and also for the reference genomes. Listed are all pangenome gene-families present at least in one sample or reference genome. Now, we can use these profiles for creating heatmap and PCoA plots.

head result_gene_presence_absence.csv 

    SRS013951 SRS014459 SRS015065 SRS019161 CP001107 FP929042 FP929043
g000001   1        1        1        1        1        1        1
g000002   1        1        1        1        1        0        0
g000003   1        1        0        1        1        0        0
g000004   0        0        0        0        1        0        0
g000005   0        0        0        0        1        0        0
g000006   0        0        0        0        1        0        0
g000007   0        0        0        0        1        0        0
g000008   0        0        0        0        1        0        0
g000009   0        0        0        0        1        0        0
°°°

read more about → panphlan_profile.py

f) Hierarchical clustering

The presence/absence gene-family profiles (option: --o_dna) can be visualized by a hierarchical clustering. The result of our example confirms that:
Strains detected in samples of the same subject cluster together due to almost identical profiles.

E. rectale heatmap

Clustering is based on Jaccard distance between presence/absence profiles and can be created by any hierachical clustering software, including hclust, Matlab, or R.

Example of using hclust2 for creating a hierarchical clustering heatmap

./panphlan/tools/hclust2/hclust2.py -i result_gene_presence_absence.csv -o heatmap.png --legend_file legend.png --image_size 30 --cell_aspect_ratio 0.01 --dpi 300 --slabel_size 24 --f_dist_f jaccard --s_dist_f jaccard

read more about hclust2 at → bitbucket/hclust2 and → MetaPhlAn2.


see also:
→ How to generate a user-specific pangenome database?
→ How to process RNA-seq data?

Updated