export2graphlan is an automatic conversion script from LEfSe, MetaPhlAn2, and HUMAnN input and/or output files, to GraPhlAn. Input file can be also given in BIOM 2.0 format.
The aim of this tool is to support biologists, helping them by automatically write the tree and the annotation file for GraPhlAn.
A graphical representation of how export2graphlan can be used in the analysis pipeline:
export2graphlan is available in Bitbucket (repository) and can be installed using Bioconda by typing:
$ conda install export2graphlan
A taxonomic tree that shows three different classes of oxygen (low, medium, and high), highlighting biomarker clades for each class. Data are taken from the HMP project.
# download the data $ wget http://huttenhower.sph.harvard.edu/webfm_send/129 -O hmp_aerobiosis_small.txt # convert the file in LEfSe format, and run LEfSe $ format_input.py hmp_aerobiosis_small.txt hmp_aerobiosis_small.in -c 1 -s 2 -u 3 -o 1000000 $ run_lefse.py hmp_aerobiosis_small.in hmp_aerobiosis_small.res # convert it! $ export2graphlan.py -i hmp_aerobiosis_small.txt -o hmp_aerobiosis_small.res -t tree.txt -a annot.txt --title "HMP aerobiosis" --annotations 2,3 --external_annotations 4,5,6 --fname_row 0 --skip_rows 1,2 --ftop 200 # attach annotation to the tree $ graphlan_annotate.py --annot annot.txt tree.txt outtree.txt # generate the beautiful image $ graphlan.py --dpi 300 --size 7.0 outtree.txt outimg.png --external_legends
The input file is downloaded from The Huttenhower Lab and given to LEfSe for biomarkers discovery. The two file (the LEfSe input and the LEfSe output files) are then passed to export2graphlan. In this case the levels 2 (phylum) and 3 (class) are annotated on the circular tree, while levels 4 (order), 5 (family), and 6 (genus) are put on the external legend.
A taxonomic tree that highlight abundant bodysites present in HMP and MetaHIT studies processed with MetaPhlAn2.
# filter stool samples $ ./get-stool.py hmp_bodysites.txt metahit_sticks.txt mp2/ new-stool/ # cut-out taxon rows in .profile files $ ./clean-taxon.sh new-stool/ # merge tables $ ./merge_metaphlan_tables.py new-stool/* > merge.txt # fix taxonomy $ sed 's/.__//g' merge.txt > merge-good.txt # put the dataset information (e.g., HMP or MetaHIT) $ sed 's/ID/dataset/' merge-good.txt > merge-good-1.txt $ sed 's/MH00[0-9]*.profile/MetaHIT/g' merge-good-1.txt > merge-good-2.txt $ sed 's/V1.[A-Z]*-[0-9]*.profile/MetaHIT/g' merge-good-2.txt > merge-good-3.txt $ sed 's/SRS[0-9]*.profile/HMP/g' merge-good-3.txt > merge-very-good.txt
Starting from the data processed with MetaPhlAn2 we used some ad-hoc scripts (get-stool.py, clean-taxon.sh, and merge_metaphlan_tables.py) to perform some filtering, cleaning, and keeping the dataset belongingness information. In particular: get-stool.py filter out all bodysites that are not stool; clean-taxon.sh removes the rows that contains the t__ in the taxonomy, from the profile files; merge_metaphlan_tables.py merge all the profiles in one single file. Finally, the last thing left is to put the dataset information (HMP or MetaHIT). To accomplish this we used the sed bash command.
# execute LEfSe $ format_input.py merge-very-good.txt merge-very-good.txt.in -c 1 -o 1000000 $ run_lefse.py -l 3.0 merge-very-good.txt.in merge-very-good.txt.out # convert it! $ export2graphlan.py -i merge-very-good.txt -o merge-very-good.txt.out -t tree.txt -a annot.txt --title "MetaHIT vs. HMP (MetaPhlAn2)" --max_clade_size 250 --min_clade_size 40 --annotations 5 --external_annotations 6,7 --abundance_threshold 40.5 --fname_row 0 --ftop 200 --annotation_legend_font_size 11 # attach annotation to the tree $ graphlan_annotate.py --annot annot.txt tree.txt outtree.txt # generate the beautiful image $ graphlan.py --dpi 300 --size 7.0 outtree.txt outimg.svg --external_legends
We called LEfSe for the biomarkers discovery, changing the threshold value to 3.0 for the LDA test. The conversion step has been done by changing the interval from which scaled the size of the nodes (from 20.0 [default] to 250.0). We decided also to annotate on the tree just the 5th level (family) while the 6th and 7th levels (genus and species, respectively) are put on the external annotation legend. We decide to keep the first top 200 clades, highlighting only the ones which abundance level is above 40.5. Finally, we increased a little bit the font size of the annotation legend.
The lasts two steps simply invoke GraPhlAn, first annotating the tree and then generating the image.
After have generate the image, we manually put back some external annotations. This was quite simple to perform, since the only thing that we had to do was look at the image and select clades to move from the legend to the tree. Then, we open the annotation file (annot.txt), look for selected clades and changing *:* with *. We just re-run the last two commands (graphlan_annotate.py and graphlan.py) to get the updated image.
A functional tree that highlight some KEGG metabolic hierarchies, comparing HMP with MetaHIT data processed with HUMAnN.
# merge HUMAnN output files $ ./merge-humann.py hmp.txt 17 metahit.txt 10 metahit_map.txt merge.txt # execute LEfSe $ format_input.py merge.txt merge.txt.in -c 1 -o 1000000 $ run_lefse.py -l 3.0 merge.txt.in merge.txt.out # convert it! $ export2graphlan.py -i merge.txt -o merge.txt.out -t tree.txt -a annot.txt --title "Metabolic pathways" --abundance_threshold 50.0 --external_annotations 3 --background_clades "Metabolism.Metabolism_of_Cofactors_and_Vitamins, Metabolism.Carbohydrate_Metabolism, Metabolism.Amino_Acid_Metabolism, Metabolism.Metabolism_of_Terpenoids_and_Polyketides, Metabolism.Metabolism_of_Other_Amino_Acids, Genetic_Information_Processing.Replication_and_Repair, Environmental_Information_Processing.Membrane_Transport" --background_colors "(150.; 100.; 100.), (55.; 100.; 100.), (280.; 80.; 88.)" --ftop 125 # attach annotation to the tree $ graphlan_annotate.py --annot annot.txt tree.txt outtree.txt # generate the beautiful image $ graphlan.py --dpi 300 --size 7.0 outtree.txt image.svg --external_legends
Also for this image we had to perform a bit of pre-processing, before be able to convert and draw the circular tree.
We had several HUMAnN output file that needed to be merged, preserving the dataset information. To do this used an ad-hoc script: merge-humann.py. After the merging phase, we execute LEfSe (again, changing the threshold value to 3.0 for the LDA test) in order to find biomarkers for functional pathways.
The conversion step has been done by consider the top 125 abundant clades, and filter them to an abundance level of at least 50. We decide to externally annotate the 3rd level (XXX) and to annotate a specific list of KEGG functional systems. We then provide a set of three colours in HSV format to use for the background clades list.
You can find the complete list of parameters along with a brief description in the readme file present in the repository.