Wiki

Clone wiki

PanPhlAn / wiki_FAQ_get_KEGG_annotation

PanPhlAn FAQ

How to get the gene sequences and functional annotation

PanPhlAn works with gene-family cluster. The centroid sequences of each gene-family cluster are in the pangenome database file panphlan_species_centroids.ffn. The gene-family name is included in the gene-ID.

>speciesID:geneFamilyID:originalGeneID 

Example

In our → German 2011 E. coli outbreak analysis, gene-family g01310 was found to be present in detected E. coli outbreak strains, but not in other samples. How to get the sequence of gene-family g01310 and how to get the function of this gene-family cluster?

a) get sequence

The sequence of the gene-family g01310 of pangenome database ecoli14 (E. coli database, version 2014) can be extracted from the file panphlan_ecoli14_centroids.ffn using grep:

grep -A 10 g01310 panphlan_ecoli14_centroids.ffn 
>ecoli14:g01310:gi|16445223|ref|NC_002655.2|:1353261-1353530
ATGAAGAAGATGTTTATGGCGGTTTTATTTGCATTAGCTTCTGTTAATGCAATGGCGGCGGATTGTGCTAAAGGTAAAAT
TGAGTTTTCCAAGTATAATGAGGATGACACATTTACAGTGAAGGTTGACGGGAAAGAATACTGGACCAGTCGCTGGAATC
TGCAACCGTTACTGCAAAGTGCTCAGTTGACAGGAATGACTGTCACAATCAAATCCAGTACCTGTGAATCAGGCTCCGGA
TTTGCTGAAGTGCAGTTTAATAATGACTGA
>ecoli14:g32172:gi|545268310|ref|NZ_KE701673.1|:23240-23509
ATGGCTACTTTTGATTTTACacacCTCAATGGATTAACACAAATCAAAGCCTTGTTTCCAGAACTTACAGAGAAACAATT
TAGGGTTACGTTAAGTTGGGTTtttGGAAGTGAAATCATTGATATAGCGAGCGAGCATGAGTGCTCGATTGAAGCGGTAA
aaaaaaCATTGCAGAGAAGTAAGCTAGCTCTTGGTTCGGAGCGGCTTGAGGCTGTAAGAGTAATCTTCTTGTGCAGGATA
ATGGCTGATCTATGGACTAGGGTAAGATAA
>ecoli14:g01356:gi|16445223|ref|NC_002655.2|:1927369-1927638

The first lines show the requested sequence of gene-family g01310.
For older centroid.ffn files that do not include the gene-family-ID, see → PanPhlAn FAQ

b) get gene-family function using KEGG or UniProt

Copy the sequence of gene-family g01310

ATGAAGAAGATGTTTATGGCGGTTTTATTTGCATTAGCTTCTGTTAATGCAATGGCGGCGGATTGTGCTAAAGGTAAAAT
TGAGTTTTCCAAGTATAATGAGGATGACACATTTACAGTGAAGGTTGACGGGAAAGAATACTGGACCAGTCGCTGGAATC
TGCAACCGTTACTGCAAAGTGCTCAGTTGACAGGAATGACTGTCACAATCAAATCCAGTACCTGTGAATCAGGCTCCGGA
TTTGCTGAAGTGCAGTTTAATAATGACTGA

and BLAST against the → KEGG database
(paste sequence into field Sequence data, select BLASTN and click Compute)

or, BLAST against the → UniProt database
(copy and paste sequence, and click Run BLAST)

Resulting KEGG annotation of gene-family g01310:
stx2B Shiga toxin 2 subunit B

See also:

→ How to map the complete PanPhlAn pangenome database against KEGG?

Updated