In-vivo transcriptional activity of individual strains of a species
When metagenomic and metatranscriptomic samples from the same specimen are available, PanPhlAn provides gene-specific transcription rates of individual strains in a sample. In a first step, the DNA sample is used to detect the strain-specific gene-family set. In the second step, transcription levels are extracted from the corresponding RNA samples and converted into log and median normalized RNA/DNA ratios.
a) Define DNA RNA sample pairs
PanPhlAn requires a list of paired metagenomic (DNA) and metatranscriptomic (RNA) sequencing sample-ID's from the same biological sample. The pairs are defined by a tab-separated text file in which the first column specifies the DNA sample-ID and the second column the corresponding RNA sample-ID, both without file extension.
cat sample_pairs_DNA_RNA.txt # DNA RNA sampleA sampleA_RNA sampleB sampleB_RNA sampleC sampleC_RNA sampleD sampleD_RNA
b) Download the species pangenome database
For example, to get transcriptional activities for Escherichia coli strains, we need to download the related pangenome database:
ecoli16 Escherichia coli (version 2016)
Both DNA and RNA samples are processed in the same way by panphlan_map.py but saved in different folders.
./panphlan_map.py -c ecoli16 -i Samples/sampleA.tar.gz -o DNA/sampleA ./panphlan_map.py -c ecoli16 -i Samples/sampleA_RNA.tar.gz -o RNA/sampleA_RNA
Now, we have mapping results of both DNA and RNA which are required in the next step for extracting gene-family transcript profiles.
ls DNA/ sampleA_ecoli16.csv.bz2 sampleB_ecoli16.csv.bz2 sampleC_ecoli16.csv.bz2 sampleD_ecoli16.csv.bz2 ls RNA/ sampleA_RNA_ecoli16.csv.bz2 sampleB_RNA_ecoli16.csv.bz2 sampleC_RNA_ecoli16.csv.bz2 sampleD_RNA_ecoli16.csv.bz2
d) Get transcription rates
Now, panphlan_profile.py can be used to extract the transcription rates of strain-specific genes.
./panphlan_profile.py -c ecoli16 --i_dna DNA/ --i_rna RNA/ --sample_pairs sample_pairs_DNA_RNA.txt --o_rna results_transcription_rates.csv
As input we provide the mapping results of DNA and RNA samples and the sample-pair text-file that defines which DNA and RNA sample belong together.
As result we get a table of transcription rates for each pangenome gene-family for each sample. Gene-families not present in the sample specific strain are marked as NP (not present). Gene-families that could not clearly defined as present only in the specific strain of a sample are marked as NaN (missing value). NaN includes multi-copy genes as well as genes at the borderline between presence and absence. Normalized transcription rates are positive values centered at 1. A low values smaller than 1 refers to a low transcriptional activity, a large value bigger than 1 refers to a high transcriptional activity relative to all other gene-family activities (median normalized).
Example result of transcription rates:
cat results_transcription_rates.csv sample_A sample_B sample_C sample_D g000002 0.814 0.779 0.000 NaN g000003 0.982 0.770 1.183 0.000 g000004 0.863 0.917 1.219 NaN g000005 NP 0.000 0.000 NaN g000006 0.773 0.000 NP 0.000 °°°