Clone wiki

PanPhlAn / panphlan_pangenome_generation


Pangenome databases are available for more than 400 species → download page

How to generate a user-specific pangenome database?

Example of generating a PanPhlAn pangenome database of Eubacterium rectale based on five reference genomes available at NCBI.

1) Download all 5 genome (.fna) and corresponding gene annotation (.gff) files from NCBI

wget -P fna/
wget -P fna/
wget -P fna/
wget -P fna/
wget -P fna/

wget -P gff/
wget -P gff/
wget -P gff/
wget -P gff/
wget -P gff/

Genomes are located in folder fna/ ; gene annotation files are located in gff/
Rename files to have a short filename that will be used as genome-ID.

ls fna
GCF_000020605.fna.gz  GCF_001404855.fna.gz  GCF_001405295.fna.gz  GCF_001406375.fna.gz  GCF_001406835.fna.gz

ls gff
GCF_000020605.gff.gz  GCF_001404855.gff.gz  GCF_001405295.gff.gz  GCF_001406375.gff.gz  GCF_001406835.gff.gz

2) Download the PanPhlAn software

a) Dowload by using wget

mkdir panphlan
mv CibioCM-panphlan-*/panphlan_* panphlan/

b) or by using the hg command

hg clone

3) Run PanPhlAn to generate the pangenome database

./panphlan/ -c erectale17 --i_fna fna/ --i_gff gff/ -o database/ --verbose

Option -c specifies the species database name to use in PanPhlAn: erectale17 (Eubacterium rectale, version 2017).

Generated 8 database files are located in the -o output folder database/ and can be moved to the BOWTIE2_INDEXES directory, if exist.

ls database/

mv database/panphlan_erectale17* $BOWTIE2_INDEXES


  • -c to specify the clade or species database-name;
  • -i_fna input folder for genome sequences
  • -i_gff input folder for gene annotation files (gene location)
  • --tmp folder for saving temporary result file
  • -o output folder for the pangenome database
  • --uc provides additional files of the usearch7 clustering
  • --verbose to display progress information

4) Check profiles of reference genomes

cd database/
../panphlan/ -c erectale17 --add_strains --o_dna genefamily_presence_absence.tsv

genefamily_presence_absence.tsv contains the gene-family profiles of the reference genomes. It can be useful to detect outlier reference genomes, not related to the species.

Help -h

./panphlan/ -h
                        Folder containing the .ffn gene sequence files
                        Folder containing the .fna genome sequence files
                        Folder containing the .gff gene annotation files
                        Name of the species pangenome database, for example:
                        -c ecoli17
                        Result folder for all database files
                        Threshold of gene sequence similarity (in percentage),
                        default: 95.0 %.
  --tmp TEMP_FOLDER     Folder for temporary files, default: TMP_panphlan_db
  --uc                  Keep all usearch7 output files
  --verbose             Show progress information
  -v, --version         Prints the current PanPhlAn version and exits requires Usearch 7


What about plasmids and contigs?

Each strain is represented by a single genome .fna fasta file and an additional .ffn or .gff file of gene sequences. All contigs and plasmids of a strain have to be in the same .fna multi-fasta file. In the same way, all gene information of a strain have to be in a single .ffn or .gff file.

See also:

How to find and download reference genomes from NCBI?
How to import Roary pangenome into PanPhlAn?

Next step

Screen your metagenomic samples for species related genes by mapping against the species database.
→ panphlan_map