Clone wiki

PanPhlAn / panphlan_profile_RNAseq

PanPhlAn RNA-seq

In-vivo transcriptional activity of individual strains of a species

When metagenomic and metatranscriptomic samples from the same specimen are available, PanPhlAn provides gene-specific transcription rates of individual strains in a sample. In a first step, the DNA sample is used to detect the strain-specific gene-family set. In the second step, transcription levels are extracted from the corresponding RNA samples and converted into log and median normalized RNA/DNA ratios.

a) Define DNA RNA sample pairs

PanPhlAn requires a list of paired metagenomic (DNA) and metatranscriptomic (RNA) sequencing sample-ID's from the same biological sample. The pairs are defined by a tab-separated text file in which the first column specifies the DNA sample-ID and the second column the corresponding RNA sample-ID, both without file extension.

cat sample_pairs_DNA_RNA.txt

 # DNA   RNA
 sampleA   sampleA_RNA
 sampleB   sampleB_RNA
 sampleC   sampleC_RNA
 sampleD   sampleD_RNA

b) Download the species pangenome database

For example, to get transcriptional activities for Escherichia coli strains, we need to download the related pangenome database:
ecoli16 Escherichia coli (version 2016)

c) Mapping

Both DNA and RNA samples are processed in the same way by but saved in different folders.

./ -c ecoli16 -i Samples/sampleA.tar.gz -o DNA/sampleA

./ -c ecoli16 -i Samples/sampleA_RNA.tar.gz -o RNA/sampleA_RNA

Now, we have mapping results of both DNA and RNA which are required in the next step for extracting gene-family transcript profiles.

ls DNA/

ls RNA/ 

d) Get transcription rates

Now, can be used to extract the transcription rates of strain-specific genes.

./ -c ecoli16 --i_dna DNA/ --i_rna RNA/ --sample_pairs sample_pairs_DNA_RNA.txt --o_rna results_transcription_rates.csv

As input we provide the mapping results of DNA and RNA samples and the sample-pair text-file that defines which DNA and RNA sample belong together.
As result we get a table of transcription rates for each pangenome gene-family for each sample. Gene-families not present in the sample specific strain are marked as NP (not present). Gene-families that could not clearly defined as present only in the specific strain of a sample are marked as NaN (missing value). NaN includes multi-copy genes as well as genes at the borderline between presence and absence. Normalized transcription rates are positive values centered at 1. A low values smaller than 1 refers to a low transcriptional activity, a large value bigger than 1 refers to a high transcriptional activity relative to all other gene-family activities (median normalized).

Example result of transcription rates:

cat results_transcription_rates.csv

         sample_A  sample_B  sample_C  sample_D
g000002   0.814     0.779     0.000     NaN
g000003   0.982     0.770     1.183     0.000
g000004   0.863     0.917     1.219     NaN
g000005   NP        0.000     0.000     NaN
g000006   0.773     0.000     NP        0.000

See also:

→ Error: RuntimeWarning: invalid value encountered in double_scalars