Clone wiki

PanPhlAn / panphlan_profile

PanPhlAn profile

Panphlan_profile is used to merge and process the panphlan_map results for getting the final gene presence/absence profiles of detected strains in samples, or for extracting the transcriptional activity of individual strains based on DNA & RNAseq pairs of the same sample.

How to get gene-family presence/absence profiles

Before running panphlan_profile, all metagenomic samples needs to be mapped against the species specific database using → panphlan_map. Then, panphlan_profile can be applied to the mapping results saved in folder Panphlan_map_results/ to obtain the final table of strain-specific gene-family presence/absence profiles:

./panphlan_profile.py -c ecoli16 -i map_results/ --o_dna result_gene_presence_absence.csv --add_strains

Main options

  • -c to specify the species database. Example: ecoli16 (Escherichia coli, version 2016) → download database
  • -i input directory of the panphlan_map results
  • --o_dna final result of gene-family presence/absence profiles of all detected strains
  • --add_strains for adding also gene-family presence/absence profiles of reference genomes
  • --verbose to display progress information

Result: Strain-specific gene-family presence/absence profiles

The option --o_dna provides the final result of a binary profile matrix of all samples that contain a strain. Gene-families are marked 1 when present and 0 when absent.

Example of a gene-familiy profile table in result file result_gene_presence_absence.csv (output option --o_dna)

        sample01 sample04 sample05 sample08
g00001      1       0        1        0
g00002      0       1        1        1
g00003      0       0        0        1
g00003      1       1        1        1
  °°°

The presence/absence matrix can be used in mathematical/statistical software (R, Python, Matlab) to visualize similarities between strains by heatmaps or PCoA plots; for investigating which gene-families are present in same strains, but not in others; and for finding potential relations of diseases associated to the presence of specific genes.

Help -h

./panphlan_profile.py -h
  -h, --help            show this help message and exit
  -i INPUT_DNA_FOLDER, --i_dna INPUT_DNA_FOLDER
                        Input directory of panphlan_map.py results, containing
                        SAMPLE.csv.bz2 files
  --i_bowtie2_indexes INPUT_BOWTIE2_INDEXES
                        Input directory of bowtie2 indexes
  -c CLADE_NAME, --clade CLADE_NAME
                        Panphlan species/clade database (e.g.: ecoli16)
  -o OUTPUT_FILE, --o_dna OUTPUT_FILE
                        Write gene family presence/absence matrix:
                        gene_presence_absence.csv
  --i_rna INPUT_RNA_FOLDER
                        RNA-seq: input directory of RNA mapping results
                        SAMPLE_RNA.csv.bz2
  --sample_pairs DNA_RNA_MAPPING
                        RNA-seq: list of DNA-RNA sequencing pairs from the
                        same biological sample.
  --th_zero MINIMUM_THRESHOLD
                        Gene family presence/absence threshold: lower are non-
                        present gene families.
  --th_present MEDIUM_THRESHOLD
                        Gene family presence/absence threshold: higher are
                        present gene families.
  --th_multicopy MAXIMUM_THRESHOLD
                        Gene family presence/absence threshold: higher are
                        multicopy gene families.
  --min_coverage MIN_COVERAGE_MEDIAN
                        Minimum coverage threshold, default: 2X
  --left_max LEFT_MAX   Strain presence/absence plateau curve threshold: left
                        max [1.25]
  --right_min RIGHT_MIN
                        Strain presence/absence plateau curve threshold: right
                        min [0.75]
  --rna_max_zeros RNA_MAX_ZEROES
                        Max accepted percent of zero coveraged gene-families
                        (default: <10 %).
  --rna_norm_percentile RNA_NORM_PERCENTILE
                        Percentile for normalizing RNA/DNA ratios
  --strain_similarity_perc SIMILARITY_PERCENTAGE
                        Minimum threshold (percentage) for genome length to
                        add a reference genome to presence/absence matrix
                        (default: 50).
  --np NON_PRESENCE_TOKEN
                        User-defined string to mark non-present genes. [NP]
  --nan NOT_A_NUMBER_TOKEN
                        User-defined string to mark multicopy and undefined
                        genes. [NaN]
  --o_covplot COV_PLOT_NAME
                        Filename for gene-family coverage plot.
  --o_covplot_normed NOR_PLOT_NAME
                        Filename for normalized gene-family coverage plot.
  --o_cov PANCOVERAGE_FILE
                        Write raw gene-family coverage matrix.
  --o_idx DNA_INDEX_FILE
                        Write gene-family plateau definitions (1, -1, -2, -3)
  --o_rna RNA_EXPRS_FILE
                        Write normalized gene-family transcription values
                        (RNA-seq).
  --strain_hit_genes_perc GENEHIT_PERC_PER_STRAIN
                        Write overlap of gene-families between samples-strains
                        and reference genomes.
  --i_cov INPUT_COV_MATRIX
                        Read coverage matrix (option --o_cov) for re-analysis
                        using other thresholds
  --num_genomes INPUT_COV_GENOMES
                        In addition to option --i_cov: number of reference
                        genomes
  --genome_avg_length INPUT_COV_LENGTH
                        In addition to option --i_cov: average number of gene-
                        families
  --add_strains         Add reference genomes to gene-family presence/absence
                        matrix.
  --interactive         Plot coverage curves to screen, and not to a file.
  --verbose             Display progress information.
  -v, --version         Prints the current PanPhlAn version and exits.

Read more at the PanPhlAn tutorial

See also:

RNA-seq: How to extract strain specific transcriptional activities?

Functional analysis: How to get gene-family sequences and annotations?

PanPhlAn profile options

→ How to adapt strain detection thresholds?

→ How to get gene-family presence/absence profiles of all reference genomes, without sample profiles?

Updated