HTTPS SSH

TaxMapper

TaxMapper is an analysis tool for a reliable mapping to a provided microeukaryotic reference database and part of a comprehensive Snakemake workflow. It is used to assign taxonomic information to each NGS read by mapping to the database and filtering low quality assignments. Additionally, TaxMapper is part of a metatranscriptome Snakemake workflow developed to perform quality assessment, functional and taxonomic annotation and (multivariate) statistical analysis including environmental data. The workflow is provided and can be easily adapted for metatranscriptome analysis of any environmental sample.

System Requirements and Installation

TaxMapper requires the following packages and tools:

Python packages

  • numpy
  • pandas
  • matplotlib
  • deepdish

Conda packages

  • rapsearch

To run the complete workflow these packages are required in addition:

Conda packages

  • cutadapt
  • snakemake
  • fastqc
  • cairo

R packages

  • r-base
  • r-data.table
  • bioconductor-edgeR
  • r-vegan
  • r-ggplot2
  • rpy2
  • r-reshape2
  • r-gridextra
  • bioconductor-gage
  • bioconductor-pathview

Installation

We recommend installation using conda with the following environment.yaml file:

name: taxmapper
channels:
 - conda
 - anaconda
 - bioconda
 - conda-forge
dependencies:
 - rapsearch
 - cutadapt
 - snakemake
 - fastqc
 - cairo
 - numpy
 - pandas
 - matplotlib
 - deepdish
 - taxmapper
 - r-base
 - r-data.table
 - bioconductor-edgeR
 - r-vegan
 - r-ggplot2
 - rpy2
 - r-reshape2
 - r-gridextra
 - bioconductor-gage
 - bioconductor-pathview
 - r-locfit
 - r-rocr

The conda environment can then be created with:

$ conda env create -f environment.yaml
$ source activate taxmapper
(taxmapper) $ taxmapper --help

Alternatively, you can download the source code from bitbucket and install it using the setup script:

$ git clone https://bitbucket.org/dbeisser/taxmapper
$ cd taxmapper
/taxmapper$ python setup.py install

In this case you have to install the requirements listed above.

First Steps

TaxMapper

TaxMapper comes with a database, workflow and test dataset. TaxMapper can be run as stand-alone tool or as part of the workflow. The first time it is launched, the reference database will be downloaded and indexed (at -d path), this may take some time. For example, it can be started with a read length of 100 and 4 threads on the files in folder fastq with:

(taxmapper) $ taxmapper run -d ../databases/taxonomy/meta_database.db -m 100 -f fastq -t 4

Alternatively, all supplementary files (databases, test data, workflow) can be downloaded from bitbucket and the provided bash script can be used to create the database index and start TaxMapper on the test dataset.

(taxmapper) $ wget https://bitbucket.org/dbeisser/taxmapper_supplement/get/supplement.zip
(taxmapper) $ unzip supplement.zip
(taxmapper) $ mv dbeisser-taxmapper_supplement-854f5f60158a/* .
(taxmapper) $ cd testdata
(taxmapper) /testdata $ test.sh

The script test.sh runs TaxMapper in default mode, which returns a taxonomic assignment (tax. supergroup, tax. group and tax. lineage) for each read in each sample in the files <sample>_taxa_filtered.tsv. The results are combined in count and frequency matrices and visualized in barplots.

Each TaxMapper module (search, map , filter, count, plot) can also be run separately. See:

(taxmapper) $ taxmapper --help
(taxmapper) $ taxmapper filter --help

for help.

Workflow

To start the metatranscriptome workflow, launch the snakemake workflow snakefile.sm found in the folder snakemake/. Please consider, that all databases that are required (like a Uniprot snapshot etc.) will automatically be downloaded, which takes up about 10GB of additional disc space and may take some time for the download.

(taxmapper) $ cd snakemake
(taxmapper) /snakemake $ snakemake -s snakefile.sm -j 4 --resources io=3

This example uses four processor cores and has restrictions due to resource-demanding jobs. The computation can be sped up by using more cores and resources. However, please note that this also increases the memory footprint of TaxMapper (using 4 cores works with 16GB RAM). Large datasets should therefore be analysed on a cluster or workstation with sufficient processors and memory.

The main outputs include:

  1. the above mentioned taxmapper results for the taxonomic assignment: folder taxmapper
  2. during the preprocessing FASTQC and cleanded FASTQ files are produced: folder fastqc and cleaned
  3. for the functional annotation KO (KEGG Orthology) cound and gene/KO/pathway annotations are returned: folder annotation
  4. a final filtered table with the taxonomic and functional annotation for each sample: <sample>complete_filtered<evalue>.tsv
  5. RDA and PCA plots for the taxonomic and functional level
  6. differential gene expression, GAGE pathway and KEGG pathway enrichment results. In the case of the test dataset the last two results are empty, since no significant results exist due to the small subset of reads. For completeness and as an example use case the rules are still executed in the workflow.

For your own data the workflow can be customized by editing the config.yaml file in the snakemake directory and/or editing the snakefile.

Running TaxMapper on a cluster

While the TaxMapper example workflow shipped with TaxMapper is intended to be run on either a compute server or a powerful workstation, Snakemake supports execution on clusters and on cloud computing services.

Conda installation

On some environments, installing conda might be complicated due to lack of privileges.

To avoid this, conda can be installed to an arbitrary path where you have write access, for example to /tmp. If you can not add this installation folder to the PATH variable, you can call the installed programs directly:

~ $ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
~ $ chmod +x Miniconda3-latest-Linux-x86_64.sh
~ $ ./Miniconda3-latest-Linux-x86_64.sh

[...]
Do you approve the license terms? [yes|no]
>>> yes

Miniconda3 will now be installed into this location:
/home/me/miniconda3

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/me/miniconda3] >>> /tmp/miniconda3
PREFIX=/tmp/miniconda3
[...]
installation finished.
Do you wish the installer to prepend the Miniconda3 install location
to PATH in your /home/timm/.bashrc ? [yes|no]
[no] >>> no

~ $ cd taxmapper
~/taxmapper $ /tmp/miniconda3/bin/conda env create -f environment.yaml
~/taxmapper $ /tmp/miniconda3/envs/taxmapper/bin/snakemake

Snakemake cluster configuration

Details for the cluster environment, like the account name, cluster submission parameters, etc. can be saved in the Cluster Configuration File.

Trouble Shooting

R library path

I always encounter problems with R in conda when R or Rstudio are already installed. By default the R installation in conda does not use the correct R library path with first priority. Steps of the workflow may thus fail because of older package installations. You can test which libraries are used with:

(taxmapper) $ R
> .libPaths()

If the conda R library path is at the first position everything is fine (like "/vol/home/beisser/miniconda3/envs/taxmapper/lib/R/library"), otherwise you might want to set it in ~/mininonda3/taxmapper/etc/conda/activate.d/env_vars.sh. If the file does not exist yet, create it with the above name and path and add:

#!/bin/sh

export R_LIBS=~/miniconda3/envs/taxmapper/lib/R/library

Afterwards deactivate and activate the taxmapper environment again to set the environmental variable.

FAQ

  1. If forward and reverse read overlap, can they be merged for the taxonomic assignment?
    • It has to be mentioned, that In addition, the alignment step is quite time-consuming and error-prone for short overlaps. A possible solution would be to include an additional rule in the workflow and use a tool such as PANDAseq (https://github.com/neufeld/pandaseq) to align the reads and then provide the assembled sequences as FASTA file to TaxMapper:
rule merge_pairs:
    input: fwd = "sample_R1.fastq",
        rev = "sample_R2.fastq"
    output: "sample_combined.fasta"
    shell: "pandaseq -f {input.fwd} -r {input.rev} −w {output}"

How to Cite

The manuscript is currently under review, a preprint is available at bioRxiv.