1. bioBakery
  2. Untitled Project
  3. biobakery

Wiki

Clone wiki

biobakery / ppanini

PPANINI Tutorial

PPANINI (Prioritization and Prediction of functional Annotation for Novel and Important genes via automated data Network Integration) is a computational pipeline that ranks genes by employing a combination of community parameters such as prevalence and abundance across samples.The resulting prioritized list of gene candidates can then be further analyzed using our visualization tools. PPANINI is available as a Bitbucket repository.

We provide support for PPANINI users via our Google group. Please feel free to send any questions to the group by posting directly or emailing ppanini-users@googlegroups.com.




1. Setup

1.2 Installation

The easiest way to install PPANINI is with pip.

To install with pip:

$ pip install ppanini

Note: if you install PPANINI as --user you need to export PATH=$PATH:/Users/username/Library/Python/2.7/bin/

After installation from pip, you may optionally test your local PPANINI environment:

$ ppanini_test

Which yields (abbreviated):

test_annotate_genes (basic_tests_annotate_genes.TestAnnotateGenesBasicFunctions) ... ok
test_get_clusters (basic_tests_ppanini.TestPPANINIBasicFunctions)
Tests the function get_clusters ... ok
test_read_gene_table (basic_tests_ppanini.TestPPANINIBasicFunctions)
Tests the function read_gene_table ... Gene Table contains 1000 genes.
ok
...

1.3 Input

PPANINI prioritizes important genes that could be uncharacterised genes according to their properties in microbial communities. The input file is a gene abundance table containing annotated genes and their abundance values in counts per million (CPM).

  • -i or --input-table

Such tables can be obtained using:


2. Quick Demo

2.1 Input file

The input file is a table of annotated gene abundances across samples. You can obtain a copy of demo input by right-clicking this link and selecting "save link as":

Note: The first line in this input is optional because it indicates if samples are from different niches. When present, the ranking of important genes will be calculated differently (please refer to the manuscript for details). The second line is required. The first column is for the annotated gene names and the other columns contain the sample names.

2.2 Running PPANINI

PPANINI can be used with gene clustering or without gene clustering. Gene clustering provides a way to group similar unannotated sequences (based on 97% homology in translated sequences), and calculate the importance of the gene clusters rather than the individual genes. This method increases the metagenomic prevalence of unannotated genes and thus their associated probability to be ranked higher in the PPANINI "importance" scale.

2.2.1 Without gene clustering

To execute PPANINI without gene clustering, you can use the demo input file described above and run the following:

$ ppanini -i genetable.txt -o ppanini_output --bypass-clustering

Which yields:

Reading the gene table...
Gene Table contains 24621 genes.
DONE
Getting centroids...
DONE
Getting centroids table...
DONE
Getting prevalence abundance...
DONE
Prioritize centroids...
DONE

2.2.2 With gene clustering

To execute PPANINI with gene clustering, you will need to provide a genes catalog (A FASTA file containing sequences for all of the unannotated genes in the input table). A demo gene catalog has been downloaded for your use and you can obtain a copy by right-clicking the link and selecting "save link as":

Run the command below, and make sure at least one of --usearch or --vsearch with a path should be provided when gene-catalog is used.:

ppanini -i genetable.txt -o ppanini_output --gene-catalog samples.fasta --usearch path/to/usearch

which yields:

Seqs  24461 (24.5k)
Clusters  21998 (22.0k)
Max size  5
Avg size  1.1
Min size  1
Singletons  19564 (19.6k), 80.0% of seqs, 88.9% of clusters
Max mem  149Mb
Time  5.00s
Throughput  4892.2 seqs/sec.
DONE
Getting centroids table...
DONE
Getting prevalence abundance...
DONE
Prioritize centroids...
DONE

2.3 Sample Output

A list of important genes (centroids) with prevelence, abundance, and ppanini score is the output of PPANINI. At the end of the analysis, a number of files are generated as an output. The file containing the important genes is named in the form XX_imp_centroids_prev_abund.txt (where XX is the name provided with the -o).

A demo output should look similar to the figure below.

  • Centroids: Gene cluster name for the important genes. This can be the UniRef ID the genes have been annotated with or the gene centroid name generated if gene clustering was used (refer to Running PPANINI for more information on gene clustering).
  • beta_prevalence: Prevalence for the gene cluster across the different niches. If niche information was not provided, this value is the same as the alpha prevalence.
  • mean_abundance: Mean abundance for the gene cluster across the samples (where the gene was present). If niche information was provided, this is the maximum mean abundance across the different niches.
  • alpha_prevalence_X: Prevalence of the gene cluster in Niche 'X'.
  • ppanini_score_X: Ranking score generated by PPANINI to score the gene cluster importance.

which yields:

#Centroids      ppanini_score   mean_abundance  prevalence
UniRef90_F1W948 14.338243219    20730.640328    0.2
UniRef90_Q49091 14.676476003    23003.8583951   0.2
UniRef90_D5V832 14.5005410805   22046.7922455   0.2
UniRef90_D5VC37 14.1736513947   19900.4962421   0.2
UniRef90_D5VC36 14.1736513947   19900.4962421   0.2
UniRef90_Q538B5 14.5005410805   22046.7922455   0.2
UniRef90_Q59514 14.676476003    23003.8583951   0.2
UniRef90_K4HMX4 14.9895934702   464679.220252   0.2
UniRef90_D5V8D7 14.676476003    23003.8583951   0.2
UniRef90_D5V8F0 14.676476003    23003.8583951   0.2
UniRef90_D5V950 14.958666704    25029.1451322   0.2
UniRef90_L0WH87 14.8184441082   23394.280222    0.2
UniRef90_D5VAP9 14.8965589789   24713.5507971   0.2
UniRef90_Q4FUF2 14.676476003    23003.8583951   0.2
UniRef90_L0WH82 14.8184441082   23394.280222    0.2
UniRef90_D5V831 14.8965589789   24713.5507971   0.2
UniRef90_D5V8X7 14.676476003    23003.8583951   0.2
UniRef90_D5VAE6 14.4196758873   21449.2559823   0.2
UniRef90_B0UUG7 14.2232726958   20467.1978976   0.2
UniRef90_D5V9T2 14.1238192383   19543.3055248   0.2
UniRef90_F1WFV3 14.0904799942   19115.2718003   0.2
UniRef90_D5V8E4 14.676476003    23003.8583951   0.2
UniRef90_L0WJB0 14.338243219    20730.640328    0.2
UniRef90_Q4FQ37 14.4196758873   21449.2559823   0.2
UniRef90_D5VBA5 14.2726844769   20657.995656    0.2
UniRef90_D5V8R5 14.8965589789   24713.5507971   0.2
UniRef90_T1R5W4 15.0204359673   535320.779748   0.2
UniRef90_D5VDF5 14.5487902709   22158.9274428   0.2
UniRef90_D5VAU1 14.2726844769   20657.995656    0.2
UniRef90_D5VAF4 14.4196758873   21449.2559823   0.2

3. Recipe to generate abundance table

To see the options and create the input gene abundance table for PPANINI using the ppanini_abundance_table script in the package, follow the steps below:

usage: ppanini_abundance_table [-h] -m MAPPER_FILE [--basename BASENAME]
                           [--bypass-abundance] [--bypass-annotation]
                           [--bypass-clust] [--bypass-write-table]
                           [--usearch USEARCH] [--vsearch VSEARCH]
                           [--diamond DIAMOND] [--rapsearch RAPSEARCH]
                           [--threads THREADS] [--uniref90 UNIREF90]
                           [--to-normalize] [--log-level LOG_LEVEL]

optional arguments:
-h, --help            show this help message and exit
-m MAPPER_FILE, --mapper-file MAPPER_FILE
                    Mapper file containing paths to data
--basename BASENAME   BASENAME for all the output files
--bypass-abundance    Bypass quantifying abundance
--bypass-annotation   Bypass annotating genes
--bypass-clust        Bypass annotating genes
--bypass-write-table  Bypass writing table
--usearch USEARCH     Path to USEARCH
--vsearch VSEARCH     Path to VSEARCH
--diamond DIAMOND     Path to DIAMOND
--rapsearch RAPSEARCH
                    Path to RAPSEARCH
--threads THREADS     Number of threads
--uniref90 UNIREF90   UniRef90 INDEX file
--to-normalize        Default HUMAnN2 table; if sam-idxstats table; enable
--log-level LOG_LEVEL
                    Choices: [DEBUG, INFO, WARNING, ERROR, CRITICAL]

Step 1: Mapper File

A PPANINI input table consists of annotated genes with their abundances across samples. You need two crucial pieces for the input table to be constructed:

  • Gene abundances
  • Gene annotations

If you have both these pieces, then you can provide a mapper file that links sample names to their corresponding gene abundance tables and annotation tables. A sample mapper file looks like:

#SAMPLE     ABUNDANCE_TABLES        ANNOTATION
SAMPLE_ID1  <PATH_TO_SAMPLE_ID1_ABUNDANCE_TABLE>    <PATH_TO_SAMPLE_ID1_ANNOTATION_TABLE>
SAMPLE_ID2  <PATH_TO_SAMPLE_ID2_ABUNDANCE_TABLE>    <PATH_TO_SAMPLE_ID2_ANNOTATION_TABLE>

Alternatively, if you have a SAM file or BAM file instead of gene abundance table, you can use the SAMS or BAMS header instead of ABUNDANCE_TABLES.

If you have gene annotations, but dont have gene abundances:

  • Gene sequences (FAAs for translated sequences. If the sequences are in the nucleotide format, please replace FAAs with FNAS) and reads
#SAMPLE     FAAS    READS   ANNOTATION
SAMPLE_ID1  <PATH_TO_SAMPLE_ID1_ABUNDANCE_TABLE>    <PATH_TO_SAMPLE_ID1_READS>      <PATH_TO_SAMPLE_ID1_ANNOTATION_TABLE>
SAMPLE_ID2  <PATH_TO_SAMPLE_ID2_ABUNDANCE_TABLE>    <PATH_TO_SAMPLE_ID2_READS>      <PATH_TO_SAMPLE_ID2_ANNOTATION_TABLE>

If you don't have gene annotations or gene abundances:

  • Contig assemblies, gff3 files, and reads
#SAMPLE     CONTIG_ASSEMBLIES       READS   GFF3S   ANNOTATION
SAMPLE_ID1  <PATH_TO_SAMPLE_ID1_CONTIGS>    <PATH_TO_SAMPLE_ID1_READS>      <PATH_TO_SAMPLE_ID1_GFF3>       <PATH_TO_SAMPLE_ID1_ANNOTATIONS>
SAMPLE_ID2  <PATH_TO_SAMPLE_ID2_CONTIGS>    <PATH_TO_SAMPLE_ID2_READS>      <PATH_TO_SAMPLE_ID2_GFF3>       <PATH_TO_SAMPLE_ID2_ANNOTATIONS>
  • Gene sequences (FAAs for translated sequences. If the sequences are in the nucleotide format, please replace FAAs with FNAS), and reads
#SAMPLE     FAAS    READS
SAMPLE_ID1  <PATH_TO_SAMPLE_ID1_FAAS>       <PATH_TO_SAMPLE_ID1_READS>
SAMPLE_ID2  <PATH_TO_SAMPLE_ID2_FAAS>       <PATH_TO_SAMPLE_ID2_READS>

Additionally, you may add information about which NICHE each sample originates from by adding the NICHE header as shown below:

#SAMPLE     FAAS    READS   NICHE
SAMPLE_ID1  <PATH_TO_SAMPLE_ID1_FAAS>       <PATH_TO_SAMPLE_ID1_READS>      SOIL
SAMPLE_ID2  <PATH_TO_SAMPLE_ID2_FAAS>       <PATH_TO_SAMPLE_ID2_READS>      STOOL

Step 2: Running ppanini_abundance_table

Once, the mapper file has been successfully created, you can create the annotated gene abundance table using the command below:

ppanini_abundance_table -m <path_to_mapper_file>

This will create the ppanini input table, which unless specified (using the --basename flag) will just be the <name of the input file>*_ppanini.txt.

The above command will only work if genes have already been annotated. Otherwise, you will need to create and specify the database index for UniRef90, as well as the program that will be used to align (i.e. DIAMOND2 or RAPSEARCH2), as below:

ppanini_abundance_table -m <path_to_mapper_file> --uniref90 <path_to_uniref90_db> --diamond <path_to_diamond>

OR

ppanini_abundance_table -m <path_to_mapper_file> --uniref90 <path_to_uniref90_db> --rapsearch <path_to_rapsearch>

As the genes catalog can be extensive, the program by default uses USEARCH or VSEARCH to cluster genes based on 97% homology in sequences. This may require you to specify where the programs are located. Please note that you may bypass this step by using the bypass-clust option.

ppanini_abundance_table -m <path_to_mapper_file> --usearch <path_to_usearch> --uniref90 <path_to_uniref90_db> --rapsearch <path_to_rapsearch>

OR

ppanini_abundance_table -m <path_to_mapper_file> --vsearch <path_to_vsearch> --uniref90 <path_to_uniref90_db> --rapsearch <path_to_rapsearch>

Step 3: Output: PPANINI Input Table

The generated output should look like:

#Niche      SOIL    SOIL    SOIL    STOOL   STOOL
#Genes      SRS056210       SRS011132       SRS019597       SRS015996       SRS016033
SRS056210.50|UniRef90_unknown       0.0670926517572 0.0     0.0     0.0     0.0
SRS056210.16|UniRef90_unknown       0.0657276995305 0.0     0.0     0.0     0.0
SRS056210.87|UniRef90_unknown       0.104700854701  0.0     0.0     0.0     0.0
SRS056210.41|UniRef90_unknown       0.0942028985507 0.0     0.0     0.0     0.0
SRS056210.9|UniRef90_unknown        0.0594059405941 0.0     0.0     0.0     0.0

Updated