PHYLOmap /

Filename Size Date modified Message
18.3 KB
1.1 MB
2.1 KB
21.1 KB
4.2 KB

Introduction

PHYLOmap utilises interactive Tree of Life (iTOL) API for drawing heatmaps with phylogenetic trees for abundance tables generated from the frequently used taxonomic profiling software (such as CREST, RDP, MEGAN, MetaPhlan, USEARCH, VSEARCH, QIIME, and AmpliconNoise) in metagenomic surveys.

The iTOL's website allows functionality for the display and manipulation of phylogenetic trees in the Newick format. It offers standard and circular tree representation, several functions on the website that allow datasets to be uploaded (using http://itol.embl.de/batch_uploader.cgi with a standard POST request) with their corresponding trees, and the ability to customise the tree displays in different ways. Each tree display can then be exported (using http://itol.embl.de/batch_downloader.cgi with POST or GET request) to several graphical formats, both bitmap and vector based. Moreover, by parametrisation of the web request, the exported tree displays can be modified to exclude user-selected leaf nodes and to collapse internal nodes. An important feature of the website is a tree generator based on NCBI taxonomy (http://itol.embl.de/other_trees.shtml) through which generation of complete sub-trees for user-provided list of NCBI’s taxonomy IDs or scientific names is made possible. Thus, PHYLOmap takes an abundance table with an optional tree, processes the data on iTOL, downloads the visualisation to the local computer, and all this is accomplished using one command-line instruction.

Citing PHYLOmap

A manuscript is currently in preparation. For now, please cite as PHYLOmap: https://bitbucket.org/umerijaz/phylomap/src

Acknowledgements

This work is funded by TSB-funded collaboration with Unilever, Development of instrumental and bioinformatic pipelines to accelerate commercial applications of metagenomics approaches. Umer's work is supported by NERC IRF NE/L011956/1, Undestanding microbial communities through in situ environmental 'omic data synthesis.

Usage Information

To use PHYLOmap, place the software along with the accompanying utilities in the user's home directory at ~/bin location. The usage information for PHYLOmap is as follows:

$ python ~/bin/PHYLOmap.py
Usage:
    python PHYLOmap.py -i <input_file> -o <output_folder> [OPTIONS]

Options:
     -s (--sqrt_transformation)     Apply square-root transformation to the data before generating the heatmap (Default: False)
     -w (--heatmap_width) NUM       Width of heatmap in pixels (Default: 20)
     -l (--label) STR               Label of the figure (Default: "PHYLOmap_label")
     -t (--title_tree) STR          Title of the generated tree (Default: "PHYLOmap_tree")
     -x (--min_color) STR           Minimum color in the heatmap (Default: "#FFFFFF")
     -y (--mid_color) STR           Mid color in the heatmap (Default: "#99CCFF")
     -z (--max_color) STR           Maximum color in the heatmap (Default: "#007FFF")
     -r (--tree_file) STR           Newick tree file incase you dont want to use NCBI's taxonomy (Default:'')
     --display_mode STR             Tree display mode; "circular" or "normal" (Default: "circular")
     --font_size NUM                Font size to be used for leaf labels (Default: 10)
     --line_width NUM               Line width in pixels (Default: 1)
     --scale_factor NUM             Default horizontal tree scale will be multiplied with this value (Default: 0.5)

Abundance tables with NCBI taxonomy

The program accepts the abundance table as a comma-delimited file with sample names as column labels and each row contains the abundance count of a single taxa for different samples, with NCBI scientific names as row labels. The content of an example abundance table in the CSV format is given below:

$ cat table.csv
Samples,S10,S11,S12,S13,S14,S15,S16,S17,S18
Anaplasma,17778,5217,15057,5915,10954,8287,9081,6681,7732
Anoxybacillus,8501,2522,6661,3061,8212,5183,4931,4153,4238
Aquifex,3166,1043,3122,1314,2488,1736,2081,1757,1855
Arcanobacterium,10834,4200,9202,4146,8338,7136,6588,8326,7119
Arcobacter,40047,11125,32711,14318,32274,19156,21298,14374,17843
Aromatoleum,80681,33713,96236,45869,74400,43648,80538,52901,69999
Arthrobacter,284594,158455,347003,154873,187129,203726,260902,441160,350738
Asticcacaulis,153936,51937,142155,54750,74398,55042,85625,67303,80820
Candidatus Azobacteroides,6416,1742,5273,1716,7051,2597,2010,1412,1903
Cardiobacterium,37985,14719,41404,19232,35920,19748,31369,21350,28477
Carnobacterium,12900,3749,9710,4240,13996,8205,5980,5253,6272
Catenulispora,81442,38432,84218,40507,40604,32762,52401,76014,56052
Caulobacter,564129,198191,532423,208191,267620,185704,313987,257678,301914
Dehalogenimonas,4947,2035,5484,2472,3781,2734,3803,3237,3316
Deinococcus,131919,54170,141610,64032,91286,62082,94371,82067,89127
Delftia,442227,170984,496215,193110,324066,166280,313520,201613,281050
Denitrovibrio,6894,2362,6361,2548,4966,3746,4064,3122,3314
Desmospora,6796,2505,6687,3143,5408,3883,4802,4263,4648
Desulfarculus,26624,11289,29870,13776,19292,12192,20433,16735,18862
Desulfatibacillum,14464,5880,15052,6901,11128,7566,10404,7827,9380

Having generated/obtained the abundance table as table.csv, one can then use it with PHYLOmap as follows:

$ python ~/bin/PHYLOmap.py -i table.csv -o test -s --font_size 100 -w 100
Generated the data file: test/data.csv

Creating the upload params
ncbiIDs : Anaplasma Anoxybacillus Aquifex Arcanobacterium Arcobacter Aromatoleum Arthrobacter Asticcacaulis Candidatus_Azobacteroides Cardiobacterium Carnobacterium Catenulispora Caulobacter Dehalogenimonas Deinococcus Delftia Denitrovibrio Desmospora Desulfarculus Desulfatibacillum

Uploading NCBI's scientific names to http://itol.embl.de/ncbi_tree_generator.cgi. This may take some time depending on how many scientific names are there and how much load there is on the itol server
Generated the tree in newick format: test/data.nwk

Creating the upload params
treeName: PHYLOmap_tree
dataset1MaxPointColor: #007FFF
treeFormat: newick
dataset1MinPointColor: #FFFFFF
dataset1File: test/data.csv
dataset1Label: PHYLOmap_label
dataset1Type: heatmap
dataset1MidPointValue: 391.0
treeFile: test/data.nwk
dataset1Separator: comma
dataset1HeatmapBoxWidth: 100
dataset1MinPointValue: 32.0
dataset1MidPointColor: #99CCFF
dataset1MaxPointValue: 752.0

Uploading the tree to http://itol.embl.de/batch_uploader.cgi. This may take some time depending on how large the tree is and how much load there is on the itol server
Tree ID: 1302096432628013960722840
iTOL output: SUCCESS: 1302096432628013960722840

Tree Web Page URL: http://itol.embl.de/external.cgi?tree=1302096432628013960722840&restore_saved=1
SUCCESS: 1302096432628013960722840

Downloading data from http://itol.embl.de/batch_downloader.cgi. This may take some time depending on how large the tree is and how much load there is on the itol server

Creating export params
datasetList: dataset1
format: pdf
tree: 1302096432628013960722840
omitDashedLines: 1
displayMode: circular
fontSize: 100
lineWidth: 1
scaleFactor: 0.5

Exported tree to test/data.pdf

The program displays the parameters used with iTOL’s API and generates a PDF document of the phylogenetic tree with the heatmap on it: NCBI_normal.png

By pasting the URL (http://itol.embl.de/external.cgi?tree=1302096432628013960722840&restore_saved=1) generated by PHYLOmap in the web browser, the tree can be manipulated interactively on iTOL's website. For example, the interface given below is the typical iTOL's user interface displaying the tree we just generated. One can move/pan the tree, zoom in to get a closer view at a specific part of the heatmap, to display the abundance count and the sample it is from on a mouse-over event, show/hide leaf nodes, prune/collapse/rotate the branches/clades, and change tree display options. Furthermore, the saved views can be shared with other users by simply passing on the generated URL which is kept on the website for a limited period of time. iTOL.png

Abundance tables with any other taxonomy

There are some software that do not support NCBI taxonomy, for example, CREST and RDP classifiers. These are widely used bioinformatic programs that perform taxonomic classification of 16S/18S rRNA gene sequences. Instead of using NCBI taxonomy, they use a more controlled and error-fixed taxonomy (SilvaMod, Greengenes), sometimes with groups that are not in NCBI. We can still use them with PHYLOmap by generating the tree in the Newick format from taxonomic paths stored in the assignment files returned from such software, and then providing the tree as an input to the program using -r switch.

Abundance table from CREST

The following two commands will produce an assignment fle named input_file_Assignments.txt for input_file.fa:

$ megablast -i input_file.fa -b 50 -v 50 -m 7 -d $PATH_TO_CREST_DB/silvamod.fasta -a 10 -o input_file.xml
$ classify -i input_file.fa -a -p input_file.xml

To extract an abundance table and the corresponding tree from CREST, the data should first be organized to follow the folder structure shown below: folder_structure.png

This organisation makes it easier to run repetitive functionality on the subfolders, each containing an assignment file for a sample with sample names as folder names or what will appear in the abundance table. Moreover, with this structure, we can use the collateResults.pl utility that: searches subfolders for files with names matching a certain pattern; extracts the quantitative information stored in a delimited format in them; and then merges the information from all of the subfolders together. The following bash one-liner will produce the abundance count of taxa (test.csv) and the corresponding tree in the Newick format (test.nwk) at Phylum level for a single sample when run inside the S1 folder on the terminal:

$ cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0=$2";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;print "Sample,S1" > k ;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}' | java -classpath ~/bin Path2Newick > test.nwk

The following set of commands will produce an abundance table (collated.csv) and the corresponding tree in the Newick format (collated.nwk) at Phylum level for multiple samples when run inside the Main folder on the terminal:

$ for i in $(ls -d */); do i=${i%%/}; cd $i; cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0=$2";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}'; cd ..; done | java -classpath ~/bin Path2Newick > collated.nwk
$ perl ~/bin/collateResults.pl -f . -p test.csv > collated.csv

The bash one-liner generates a test.csv file in each subfolder, which is then used in the second command to collate the results together to produce collated.csv.

Abundance table from RDP

The following command will produce an output file named input_file_Assignments.txt for input_file.fa:

$ java -Xmx1g -jar classifier.jar classify -f filterbyconf -o input_file_Assignments.txt input_file.fa 

We follow the same folder structure as mentioned before. The bash one-liners for RDP are slightly different due to the different format of the assignment files, however they follow the same philosophy. The following bash one-liner will produce the abundance count of taxa (test.csv) and the corresponding tree in the Newick format (test.nwk) at Phylum level for a single sample when run inside the S1 folder on the terminal:

$ cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0="Cellular_organisms;"$3";"$6";"$9";"$12";"$15";"$18";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;print "Sample,S1" > k ;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}' | java -classpath ~/bin Path2Newick > test.nwk

The following set of commands will produce an abundance table (collated.csv) and the corresponding tree in the Newick format (collated.nwk) at Phylum level for multiple samples when run inside the Main folder on the terminal:

$ for i in $(ls -d */); do i=${i%%/}; cd $i; cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0="Cellular_organisms;"$3";"$6";"$9";"$12";"$15";"$18";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}'; cd ..; done | java -classpath ~/bin Path2Newick > collated.nwk
$ perl ~/bin/collateResults.pl -f . -p test.csv > collated.csv

For producing trees at other taxonomic levels, in the first awk statement in the above bash one-liners, use: l=4 for Class, l=5 for Order, l=6 for Family, and l=7 for Genus level breakdown, respectively. These bash one-liners will work even if we run the classifiers on the dereplicated reads in the USEARCH header format i.e. containing the string size=.* in the FASTA headers. If the original dereplicated sequences are in a different format, we first convert them to the USEARCH header format before running the classifiers and using the above bash one-liners. For example, if we have generated the denoised and dereplicated reads from AmpliconNoise, then the following bash one-liner is suggested to convert the FASTA headers:

$ awk '/>/{$0=gensub("(.*?)_(.*?)$","\\1;size=\\2;","g",$0)}1' dereplicated.fa > dereplicated_usearch_format.fa

Please Note: Running the bash one-liners for multiple samples for a different value of l) requires you to deletetest.csv file from individual folders. This can be done by running the command in the main folder: for i in $(ls -d *); do rm $i/test.csv; done

How is the tree generated then?

The workflow for RDP and CREST relies on the Path2Newick.java utility to convert extracted paths to tree in the Newick format. For this purpose, the extracted paths are piped to PATH2Newick.java to generate the Newick tree: Path2Newick.png

OTU tables from USEARCH, VSEARCH and AmpliconNoise

If the Amplicon sequences are processed by USEARCH, VSEARCH, or AmpliconNoise to obtain an OTU table (otu_table.csv) and the corresponding representative sequences (otus.fa), one can either use RDP or CREST classifier as before, but we can also generate a phylogenetic tree by multiple sequence alignment of the OTUs using any of the following two approaches:

Approach 1: QIIME

We'll use align_seq.py to align the OTU sequences against a reference alignment. In QIIME, this reference alignment is core_set_aligned.fasta.imputed and QIIME already knows where it is. The following is done using PyNAST, though the multiple sequence alignment can also be obtained with MUSCLE and Infernal using the same program.

$ align_seqs.py -i otus.fa -o alignment/

This alignment contains lots of gaps, and it includes hypervariable regions that make it difficult to build an accurate tree. So, we'll filter it using filter_alignment.py. Filtering an alignment of 16S rRNA gene sequences can involve a Lane mask. In QIIME, this Lane mask for the GG core is lanemask_in_1s_and_0s

$ filter_alignment.py -i alignment/otus_aligned.fasta -o alignment

The make_phylogeny.py script uses the FastTree, an approximately maximum likelihood program and a good model of evolution for 16S rRNA gene sequences. We will use it as follows:

$ make_phylogeny.py -i alignment/otus_aligned_filtered.fasta -o otus.nwk

Approach 2: MAAFT and FastTree

Use the following commands to generate the phylogenetic tree in newick format:

$ mafft-ginsi otus.fa > otus.gfa
$ FastTree -nt -gtr < otus.gfa > otus.nwk

Output for example datasets along with the parameters used with the program

Once you have generated the trees based on the one-liners given above, you can test the program as follows:

NCBI circular

Command:

$ python ~/bin/PHYLOmap.py -i table.csv -o test -s -w 15 -l NCBI -t NCBI_tree --font_size 20 --line_width 1 --scale_factor 0.2 -x "#00A849" -y "#008A3C" -z "#005223"

Ouput:

NCBI_circular.jpg

RDP normal (Family level)

Command:

$ python ~/bin/PHYLOmap.py -i collated.csv -o collated collated.nwk --font_size 20 --line_width 1 -s -w 15 -l RDP -t RDP_tree -r collated.nwk --display_mode normal --font_size 20 --line_width 1 --scale_factor 0.1 -x "#0078FF" -y "#003E85" -z "#001C3B"

Output:

RDP_normal.jpg

RDP circular (Family level)

Command:

$ python ~/bin/PHYLOmap.py -i collated.csv -o collated -s -w 15 -l RDP -t RDP_tree -r collated.nwk --font_size 20 --line_width 1 --scale_factor 0.2 -x "#0078FF" -y "#003E85" -z "#001C3B"

Output:

RDP_circular.jpg

CREST normal (Family level)

Command:

$ python ~/bin/PHYLOmap.py -i collated.csv -o collated -s -w 15 -l CREST -t CREST_tree -r collated.nwk --display_mode normal --font_size 20 --line_width 1 --scale_factor 0.1 -x "#FAAB00" -y "#916300" -z "#452F00"

Output:

CREST_normal.jpg

CREST circular (Family level)

Command:

$ python ~/bin/PHYLOmap.py -i collated.csv -o collated -s -w 15 -l CREST -t CREST_tree -r collated.nwk --font_size 20 --line_width 1 --scale_factor 0.2 -x "#FAAB00" -y "#916300" -z "#452F00"

Output:

CREST_circular.jpg

OTUs normal

Command:

$ python ~/bin/PHYLOmap.py -s -i otu_table.csv -o OTUS -w 20 -l OTUS -t OTUS_tree -r otus.nwk --display_mode normal --font_size 20 --line_width 1 --scale_factor 0.1 -x "#FAAB00" -y "#916300" -z "#452F00"

Output:

OTUS_normal.jpg