Pangenome databases are available for more than 400 species

How to generate a user-specific pangenome database?

Example of generating a PanPhlAn pangenome database of Eubacterium rectale based on five reference genomes available at NCBI.

1) Download all 5 genome (.fna) and corresponding gene annotation (.gff) files from NCBI

wget -P fna/
wget -P fna/
wget -P fna/
wget -P fna/
wget -P fna/

wget -P gff/
wget -P gff/
wget -P gff/
wget -P gff/
wget -P gff/

Genomes are located in folder fna/ ; gene annotation files are located in gff/
Rename files to have a short filename that will be used as genome-ID.

ls fna
GCF_000020605.fna.gz  GCF_001404855.fna.gz  GCF_001405295.fna.gz  GCF_001406375.fna.gz  GCF_001406835.fna.gz

ls gff
GCF_000020605.gff.gz  GCF_001404855.gff.gz  GCF_001405295.gff.gz  GCF_001406375.gff.gz  GCF_001406835.gff.gz

2) Download the PanPhlAn software

a) Dowload by using wget

mkdir panphlan
mv CibioCM-panphlan-*/panphlan_* panphlan/

b) or by using the hg command

hg clone

3) Run PanPhlAn to generate the pangenome database

./panphlan/ -c erectale17 --i_fna fna/ --i_gff gff/ -o database/ --verbose

Option -c specifies the species database name to use in PanPhlAn: erectale17 (Eubacterium rectale, version 2017).

Generated 8 database files are located in the -o output folder database/ and can be moved to the BOWTIE2_INDEXES directory, if exist.

ls database/

mv database/panphlan_erectale17* $BOWTIE2_INDEXES


  • -c to specify the clade or species database-name;
  • -i_fna input folder for genome sequences
  • -i_gff input folder for gene annotation files (gene location)
  • --tmp folder for saving temporary result file
  • -o output folder for the pangenome database
  • --uc provides additional files of the usearch7 clustering
  • --verbose to display progress information

4) Check profiles of reference genomes

cd database/
../panphlan/ -c erectale17 --add_strains --o_dna genefamily_presence_absence.tsv

genefamily_presence_absence.tsv contains the gene-family profiles of the reference genomes. It can be useful to detect outlier reference genomes, not related to the species.

Help -h

./panphlan/ -h
                        Folder containing the .ffn gene sequence files
                        Folder containing the .fna genome sequence files
                        Folder containing the .gff gene annotation files
                        Name of the species pangenome database, for example:
                        -c ecoli17
                        Result folder for all database files
                        Threshold of gene sequence similarity (in percentage),
                        default: 95.0 %.
  --tmp TEMP_FOLDER     Folder for temporary files, default: TMP_panphlan_db
  --uc                  Keep all usearch7 output files
  --verbose             Show progress information
  -v, --version         Prints the current PanPhlAn version and exits requires Usearch 7


What about plasmids and contigs?

Each strain is represented by a single genome .fna fasta file and an additional .ffn or .gff file of gene sequences. All contigs and plasmids of a strain have to be in the same .fna multi-fasta file. In the same way, all gene information of a strain have to be in a single .ffn or .gff file.

