Clone wiki

PanPhlAn / panphlan_pangenome_generation

PanPhlAn

Pangenome databases are available for more than 400 species → download page

How to generate a user-specific pangenome database?

Example of generating a PanPhlAn pangenome database of Eubacterium rectale based on five reference genomes available at NCBI.

1) Download all 5 genome (.fna) and corresponding gene annotation (.gff) files from NCBI

wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/605/GCF_000020605.1_ASM2060v1/GCF_000020605.1_ASM2060v1_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/404/855/GCF_001404855.1_13414_6_44/GCF_001404855.1_13414_6_44_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/405/295/GCF_001405295.1_14207_7_7/GCF_001405295.1_14207_7_7_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/375/GCF_001406375.1_14207_7_91/GCF_001406375.1_14207_7_91_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/835/GCF_001406835.1_T1815/GCF_001406835.1_T1815_genomic.fna.gz

wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/605/GCF_000020605.1_ASM2060v1/GCF_000020605.1_ASM2060v1_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/404/855/GCF_001404855.1_13414_6_44/GCF_001404855.1_13414_6_44_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/405/295/GCF_001405295.1_14207_7_7/GCF_001405295.1_14207_7_7_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/375/GCF_001406375.1_14207_7_91/GCF_001406375.1_14207_7_91_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/835/GCF_001406835.1_T1815/GCF_001406835.1_T1815_genomic.gff.gz

Genomes are located in folder fna/ ; gene annotation files are located in gff/
Rename files to have a short filename that will be used as genome-ID.

ls fna
GCF_000020605.fna.gz  GCF_001404855.fna.gz  GCF_001405295.fna.gz  GCF_001406375.fna.gz  GCF_001406835.fna.gz

ls gff
GCF_000020605.gff.gz  GCF_001404855.gff.gz  GCF_001405295.gff.gz  GCF_001406375.gff.gz  GCF_001406835.gff.gz

2) Download the PanPhlAn software

a) Dowload by using wget

wget https://bitbucket.org/cibiocm/panphlan/get/default.zip
unzip default.zip
mkdir panphlan
mv CibioCM-panphlan-*/panphlan_* panphlan/

b) or by using the hg command

hg clone https://bitbucket.org/CibioCM/panphlan

3) Run PanPhlAn to generate the pangenome database

./panphlan/panphlan_pangenome_generation.py -c erectale17 --i_fna fna/ --i_gff gff/ -o database/ --verbose

Option -c specifies the species database name to use in PanPhlAn: erectale17 (Eubacterium rectale, version 2017).

Generated 8 database files are located in the -o output folder database/ and can be moved to the BOWTIE2_INDEXES directory, if exist.

ls database/
 panphlan_erectale17.1.bt2
 panphlan_erectale17.2.bt2
 panphlan_erectale17.3.bt2
 panphlan_erectale17.4.bt2
 panphlan_erectale17_centroids.ffn
 panphlan_erectale17_pangenome.csv
 panphlan_erectale17.rev.1.bt2
 panphlan_erectale17.rev.2.bt2

mv database/panphlan_erectale17* $BOWTIE2_INDEXES

Options

  • -c to specify the clade or species database-name;
  • -i_fna input folder for genome sequences
  • -i_gff input folder for gene annotation files (gene location)
  • --tmp folder for saving temporary result file
  • -o output folder for the pangenome database
  • --uc provides additional files of the usearch7 clustering
  • --verbose to display progress information

4) Check profiles of reference genomes

cd database/
../panphlan/panphlan_profile.py -c erectale17 --add_strains --o_dna genefamily_presence_absence.tsv

genefamily_presence_absence.tsv contains the gene-family profiles of the reference genomes. It can be useful to detect outlier reference genomes, not related to the species.

Help -h

./panphlan/panphlan_pangenome_generation.py -h
  --i_ffn INPUT_FFN_FOLDER
                        Folder containing the .ffn gene sequence files
  --i_fna INPUT_FNA_FOLDER
                        Folder containing the .fna genome sequence files
  --i_gff INPUT_GFF_FOLDER
                        Folder containing the .gff gene annotation files
  -c CLADE_NAME, --clade CLADE_NAME
                        Name of the species pangenome database, for example:
                        -c ecoli17
  -o OUTPUT_FOLDER, --output OUTPUT_FOLDER
                        Result folder for all database files
  --th IDENTITY_PERCENATGE
                        Threshold of gene sequence similarity (in percentage),
                        default: 95.0 %.
  --tmp TEMP_FOLDER     Folder for temporary files, default: TMP_panphlan_db
  --uc                  Keep all usearch7 output files
  --verbose             Show progress information
  -v, --version         Prints the current PanPhlAn version and exits

panphlan_pangenome_generation.py requires Usearch 7

FAQ

What about plasmids and contigs?

Each strain is represented by a single genome .fna fasta file and an additional .ffn or .gff file of gene sequences. All contigs and plasmids of a strain have to be in the same .fna multi-fasta file. In the same way, all gene information of a strain have to be in a single .ffn or .gff file.

See also:

How to find and download reference genomes from NCBI?
How to import Roary pangenome into PanPhlAn?

Next step

Screen your metagenomic samples for species related genes by mapping against the species database.
→ panphlan_map

Updated