Wiki

Clone wiki

Lazypipe / UserGuide.v2

LAZYPIPE User Guide

Table of Content

  1. About Lazypipe
  2. Running on CSC
  3. Installing
  4. Running Lazypipe with lazypipe.pl
  5. Running Lazypipe with Snakemake
  6. Retrieving reads for a contig or taxid
  7. Citing Lazypipe
  8. Contact

About Lazypipe

Lazypipe is a bioinformatic pipeline for analyzing virus and bacteria metagenomics from NGS data.

Lazypipe flowchart Figure 1. Lazypipe workflow

Lazypipe supports:

  • fastq preprocessing
  • de novo assembling
  • taxonomic binning
  • taxonomic profiling
  • reporting
    • mapped contigs sorted by taxa
    • virus contigs
    • unmapped contigs
    • contig annotations (tsv and excel)
    • taxon abundancies (tsv, excel and KronaGraph)
  • quality control plots

Running Lazypipe on CSC

Lazypipe can be quickly assessed using a preinstalled module at the Finnish Center of Scientific Computing.

Installing Lazypipe

Setting up directories

Create root directory for storing reference and taxonomy databases. Change /my/data/path/ according to your preferences:

mkdir /my/data/path/taxonomy

For convenience add environment variable $data referring to your /my/data/path. To add the variable locate the .bashrc file in your home directory and add this line to the file:

export data=/my/data/path

Load your variables (will autoload on the next login):

source ~/.bashrc

Cloning the repository

git clone https://plyusnin@bitbucket.org/plyusnin/lazypipe.git
cd lazypipe

Installing dependencies

Installing dependencies with Conda

conda create -n blast -c bioconda blast
conda create -n lazypipe -c bioconda -c eclarke bwa centrifuge csvtk fastp krona megahit mga minimap2 samtools seqkit spades snakemake-minimal taxonkit trimmomatic numpy scipy fastcluster requests

Or from conda yaml files:

conda env create -f blast.yml
conda env create -f lazypipe.yml

This will create separate conda environment for blast. All other tools are installed under lazypipe. To activate all installed binaries type:

conda activate blast
conda activate --stack lazypipe

Set taxonomy database location for KronaGraph (replace $CONDA_PREFIX and $data according to your settings):

 rm -rf $CONDA_PREFIX/conda/env/lazypipe/opt/krona/taxonomy
 ln -s $data/taxonomy $CONDA_PREFIX/conda/env/lazypipe/opt/krona/taxonomy

Set env variable $TM to point to trimmomatic directory:

 export TM=$CONDA_PREFIX/share/trimmomatic

Download PANNZER (version 02/2022 or later) and set runsanspanz.py as executable to your path:

wget http://ekhidna2.biocenter.helsinki.fi/sanspanz/SANSPANZ.3.tar.gz
tar -zxvf SANSPANZ.3.tar.gz
sed -i "1 i #!$(which python)" SANSPANZ.3/runsanspanz.py
ln -sf  $(pwd)/SANSPANZ.3/runsanspanz.py ~/bin/runsanspanz.py

Installing dependencies manually

Download and unpack dependencies listed in Table 1. Then copy or link these executables to your ~/bin folder. For example:

wget https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2
tar -xjvf minimap2-2.24_x64-linux.tar.bz2
cp minimap2-2.24_x64-linux/minimap2 ~/bin/

Note that Snakemake requires conda for installation (for details see https://snakemake.readthedocs.io/):

conda create -c bioconda -n snakemake snakemake-minimal
conda activate snakemake
Tool Website Download binaries Original article CSC environment
[blast] https://blast.ncbi.nlm.nih.gov/ blast+/LATEST/ https://doi.org/10.1186/1471-2105-10-421 biokit module
bwa-mem https://github.com/lh3/bwa bio-bwa/files https://arxiv.org/abs/1303.3997 biokit module
[Centrifuge] https://ccb.jhu.edu/software/centrifuge/ centrifuge-1.0.3-beta-Linux_x86_64.zip https://doi.org/10.1101/gr.210641.116 NA
csvtk https://bioinf.shenwei.me/csvtk/ csvtk/download NA
fastp https://github.com/OpenGene/fastp http://opengene.org/fastp/fastp https://doi.org/10.1093/bioinformatics/bty560 NA
KronaTools https://github.com/marbl/Krona/wiki/KronaTools NA https://doi.org/10.1186/1471-2105-12-385 biokit module
MEGAHIT https://github.com/voutcn/megahit/ IMEGAHT-1.2.9-Linux-x86_64-static.tar.gz https://doi.org/10.1016/j.ymeth.2016.02.020 biokit module
MGA http://metagene.nig.ac.jp/metagene/ http://metagene.nig.ac.jp/metagene/download_mga.html https://doi.org/10.1093/nar/gkl723 NA
minimap2 https://github.com/lh3/minimap2 minimap2-2.24_x64-linux.tar.bz2 https://doi.org/10.1093/bioinformatics/bty191 biokit module
PANNZER/SANS http://ekhidna2.biocenter.helsinki.fi/sanspanz/ SANSPANZ.3.tar.gz https://doi.org/10.1002/pro.4193 biokit module
TaxonKit https://bioinf.shenwei.me/taxonkit/ taxonkit/releases/tag/v0.9.0 https://doi.org/10.1016/j.jgg.2021.03.006 NA
[Trimmomatic] https://github.com/usadellab/Trimmomatic v0.39.tar.gz https://doi.org/10.1093/bioinformatics/btu170 biokit module
Samtools http://www.htslib.org/ samtools-1.14.tar.bz2 https://doi.org/10.1093/gigascience/giab008 biokit module
SeqKit https://bioinf.shenwei.me/seqkit/ seqkit_linux_amd64.tar.gz https://doi.org/10.1371/journal.pone.0163962 NA
Snakemake https://snakemake.readthedocs.io/ NA https://doi.org/10.12688/f1000research.29032.2 NA
[SPAdes] https://github.com/ablab/spades SPAdes-3.15.3-Linux.tar.gz https://doi.org/10.1002/cpbi.102 biokit module

Table 1: Lazypipe dependencies Tools in square brackets mark binaries that are not required for basic Lazypipe runs. When installed, these will provide additional options/functionalities.

Installing Perl modules

Install modules to local-lib ~/perl5

cpan --local-lib=~/perl5 File::Basename File::Temp Getopt::Long YAML::Tiny
export PERL5LIB=~/perl5/lib/perl5:{$PERL5LIB}

Installing R libraries

Open R console and type

install.packages( c("reshape","openxlsx") );

Installing reference databases

Download and install reference databases using Table 1 and the following instructions.

Start by installing NCBI Taxonomy. In config.yaml set local path to taxonomy. Then install by running:

perl perl/install_db.pl --db taxonomy

Running 1st round annotations with SANS or Minimap2 and 2nd round annotations with BLASTN (recommended):

  • SANS: no databases required

  • Minimap2: in config.yaml set local path to minimap_db. Then download and unpack the latest NCBI NT abv minimap database to that location.

  • BLASTN: in config.yaml set local path to blastn_vi_db. Then install by running:

    perl perl/install_db.pl --db blastn_vi
    

Running 1st round annotations with BLASTP or Centrifuge:

  • BLASTP: in config.yaml set local path to blastp_db. Then download your preferred BLAST database to that location.

  • Centrifuge: in your config.yaml set local path to centrifuge_db. Then download and unpack NCBI NT habv centrifuge index to that location.

Running 2nd round annotations for bacteriophages:

  • in config.yaml set local paths to minimap_db_phages and minimap_db_phages_metadata. Then download Gut Phage Database (GPD_sequences.fa and GPD_metadata.tsv) to these locations.
  • index the database by running:

    minimap2 -t 4 -x asm20 -d GPD_metadata.fa.mmi GPD_metadata.fa
    
URL Local path (config.yaml) Installation Description
blast/db/ blastp_db See NCBI manual NCBI BLAST nr or similar
ref_viruses_rep_genomes.tar.gz blastn_vi_db perl/install_db.pl --db blastn_vi RefSeq viruses representative genomes
blast_gb_vi_2023_01_01.tar.gz blastn_vi_db perl/install_db.pl --db blastn_vi NCBI GeneBank Viruses Complete genomes
centrifuge_db_url centrifuge_db download and unpack data/nt_2021_12_habv_cent.tar.gz centrifuge index with Hsapiens_GRCh38p13 assembly + bacteria + archaea + virus sequences from NCBI nt database
minimap_db_url minimap_db download and unpack data/YYYY_MM_DD.nt_abv.tar.gz Archaeal, bacterial and virus sequences from NCBI nt database
taxdump.tar.gz taxonomy perl/install_db.pl --db taxonomy NCBI Taxonomy database dump files
GPD_sequences.fa.gz minimap_db_phages download and index Gut phage database
GPD_metadata.tsv minimap_db_phages_metadata download Gut phage database

Table 2. Databases used by Lazypipe.

Test Perl and Snakemake interfaces

Perl interface

perl lazypipe.pl

Snakemake interface

snakemake -np all

Running Lazypipe with lazypipe.pl

lazypipe.pl runs your metagenomic analysis step-by-step. For example, to run preprocessing and assembling type

perl lazypipe.pl -1 data/sample_R1.fq.gz --pipe pre,ass -v

###lazypipe.pl command-line options:

Short Long Value Default Description
-1 --read1 file Paired-end reads, fastq with forward reads (can be gzipped)
-2 --read2 file guess from --read1 Paired-end reads, fastq with reverse reads
--se false Input reads are SE-reads. Any --read2 file will be ignored
-r --res dir results Results will be printed to res-dir/sample-dir/
-s --sample str --read1 prefix Results will be printed to res-dir/sample-dir/
--logs dir logs Logs will be printed to logs-dir/sample-dir/
-t --numth int 8 Number of threads
--pre str fastp Use fastp|trimm|none to preprocess reads
--ass str megahit Assembler: megahit|spades
--ann str sans Homology search used for contig annotation: blastp|sans|centrifuge|minimap
--hostgen file *.fna file containing host genome. Filtering is turned on by --hostgen file -p flt
--hgtaxid taxid NCBI taxon id for the host genome taxon. When given, hostgen filtered reads will be assigned to this taxon
-w --weights str bitscore Model for abundance estimation: taxacount|bitscore|bitscore2
--config file config.yaml Configuration file for default options
-v false Verbal mode
--clean false Delete intermediate files after each step
-p --pipe str main Comma-separated list of steps to perform, e.g. --pipe pre,flt,ass,ann,realign,sta,pack
pre|preprocess Preprocess reads, i.e. filter low quality reads
flt|filter Filter reads mapping to host genome using --hostgen file
ass|assemble Assemble reads to contigs
rea|realign Realign reads to contigs
ann|annotate Annotate contigs with blastp/sans/centrifuge/minimap2 against blastp_db/UniProt/centrifuge_db/minimap_db.
blastv Annotate viral contigs with blastn against custom virus database (blastn_vi_db).
blastu Annotate unmapped contigs with blastn against custom virus database (blastn_vi_db).
annph Annotate unmapped contigs with minimap against local bacteriophage database (minimap_db_phages)
rep|report Create reports: abundance/annotation tables + Krona graph + sort contigs by taxa
sta|stats Create assembly stats + QC plots
pack Pack results into a tarball. Tarball will be created to the root directory of --res dir.
clean Clean up all intermediate and temporary files.
main Run main steps: pre,flt,ass,rea,ann,rep,sta,pack,clean [default]
all Run all steps

Default options and additional settings are defined in config.yaml file. Note that command line options take precedence over options in config.yaml file.

###Additional options in config.yaml:

Option Value Description
GENERAL PARAMETERS
R_call str Rscript or similar for calling R
hostgen file Path to host genome in fasta/fasta.gz format. Set to 0 to switch off hostgen filtering.
hostgen_taxid num NCBI taxon id for the host genome taxon. When defined, hostgen filtered reads will be assigned to this taxon
hostgen_flt_th num Minimum alignment score for filtering host genome reads
min_gene_length num Minimum ORF sequence length for reporting/mapping
min_sans_bits num Minimum alignment score for mapping ORFs with SANS
min_blastp_bits num Minimum alignment score for mapping ORFs with BLASTP
min_cent_bits num Minimum alignment score for mapping contigs with Centrifuge
min_minimap_DPpeak_score num Minimum DP alignment score for mapping contigs with Minimap2
realign_read_th num Minimum alignment score for mapping reads to contigs with BWA MEM
tail percent Remove taxa that correspond to this percentile in abundance estimation. Set to zero to keep all predictions
cont_score_tail percent Remove taxa from contig that correspond to this percentile. Reduces noise in abundance estimation.
trimm_par str Trimmomatic parameters. NOTE: please ensure that $TM envirnoment variable is pointing to Trimmomatic installation root
fastp_par str Fastp parameters
res dir Results directory root. Results will be printed to res-dir/sample-dir/
logs dir Logs directory root. Logs will be printed to logs-dir/sample-dir/
tmpdir dir Temporary directory root. Each run will create a designated temporary directory at this location
keep_tmpdir 0/1 Set to 1 to keep temporary directories
DATABASES
blastp_db $BLASTDB/nr Path to local NCBI blastp nr database.
blastn_vi_db path Path to local blastn nucleotide virus database.
blastn_vi_db_url url URL to blastn_vi_db resource
centrifuge_db path Path to local Centrifuge nucleotide database: e.g. h+a+b+v.
minimap_db path Path to local minimap2 nucleotide database.
minimap_db_phage path Path to local minimap2 bacteriophage database.
minimap_db_phage_meta path Patho to minimap_db_phage metainfo. Expecting TSV file with header line and first column with minimap_db_phage sequence ids.
taxonomy dir Path to local NCBI taxonomy database. Database will be installed on demand
taxonomy_update 0/1 Set to 1 to update NCBI taxonomy db
taxonomy_update_time num NCBI taxonomy update frequency in days
taxonomy_url str URL to NCBI taxonomy (taxdump.tar.gz)
SNAKEMAKE PARAMETERS
datain input fastq files List of input fastq libraries ordered by sample name, e.g. "M15: M15_R1.fastq". Note: only list forward reads, reverce reads are guessed by substituting _R1 suffix with _R2.
blastn_contigs_vi 0/1 Similar to --pipe blastv. Annotate viral contigs with blastn against blastn_vi_db.
blastn_contigs_un 0/1 Similar to --pipe blastu. Annotate unmapped contigs with blastn against blastn_vi_db.
threads_max num Maximum number of threads to use

Example 1: analyzing sample data with lazypipe.pl

For this example, we will use data/samples/M15small_R\*.fastq (PE Illumina reads from mink feces env sample).

Run main steps with default options

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p main -t 8

Run preprocessing with Trimmomatic

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p pre --pre trimm -t 8 -v

Filter host reads with Neovison vison genome

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG.v01/GCA_900108605.1_NNQGG.v01_genomic.fna.gz
mv GCA_900108605.1_NNQGG.v01_genomic.fna.gz $data/hostgen/
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -t 8 -v

Run assembling with SPAdes + realign reads to assembly

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p ass,rea --ass spades -t 8

Run annotation with minimap2 + update reports

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p ann,rep --ann minimap -t 8 -v

Confirm virus contigs with local blastn

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p blastv -t 8 -v

Search for viruses in unmapped contigs with local blastn

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p blastu -t 8 -v

Pack results to *.tar.gz

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p pack -v

Example 1: generated reports

By default, all results are printed to ./res-dir/sample-dir/, in this case to ./results/M15/:

Assembled contigs and predicted ORFs

file/dir description
contigs contigs sorted by taxa
contigs.fa contigs in a single fasta file
contigs_un.fa contigs with no annotation by the main homology search
contigs_vi.fa contigs annotated as virus sequences by the main homology search
ORFs.gtf predicted ORFs in GTF2.2 format
ORFs.aa.fa predicted ORFs as aa sequences
ORFs.nt.fa predicted ORFs as nt sequences
scaffolds.fa scaffolds, if available

Abundance tables

abund_table.xlsx Figure 2. abund_table.xlsx

Spreadsheets with taxon abundancies are printed to abund_table.xlsx. In the bundancies are displayed in separate tables for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots. For each domain abundancies are displayed at three taxonomic levels: species, genus and family.

For raw abundance data see abund_table.tsv.

Columns in abund_table.xlsx

column description
readn read pairs assigned to this taxon
readn_pc percentage of reads pairs assigned to this taxon
csum cumulative read distribution score (percentage of reads mapped to this taxon and more abundant taxa)
csumq confidences score based on csum (1 ~ reliable, 2 ~ intermediate, 3 ~ unreliable)
contign contigs assigned to this taxon
species species name (NCBI taxonomy)
species_id species taxid (NCBI taxonomy)
genus genus name
genus_id genus taxid
family family name
family_id family taxid

Annotation tables

contigs.annot.xslx Figure 3. contigs.annot.xslx

Spreadsheets with contig annotations are printed to contigs.annot.xslx. Spreadsheets are displayed separately for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots. Columns displayed depend on the applied homology search (sans/blastp/minimap2).

Running -p blastv will also print blastn annotation for contigs_vi.fa to contigs_vi.annot.xlsx.

Running -p blastu will also print blastn annotation for contigs_un.fa to contigs_un.annot.xlsx.

For raw annotation data see contigs[_un|_vi].annot.tsv.

Key columns in contigs[_un|_vi].annot.xslx:

column description
contig contig id
coverage contig coverage
length contig length
ORF orf description in start-end:strand format
sseqid subject sequence id
bitscore alignment score
qcov query coverage
scov subject coverage
qlen query sequence length
slen subject sequence length
pide percent identity
lali alignment length
desc subject description
staxid assigned taxid
species assigned species
genus assigned genus
family assigned family

Krona graph

krona_graph.html Figure 4. krona_graph.html

Estimated taxon abundancies are also displayed as an interactive Krona graph: krona_graph.html.

Quality control plots

QC plots for a number of samples Figure 5. Read survival plots

Quality Control (QC) plots include length histograms for reads and contigs, and survival plots. The survival plots track retained reads after each pipeline step.

file description
qc.read1.jpeg length hist for forward reads
qc.read2.jpeg length hist for reverse reads
qc.contigs.jpeg length hist for contigs
qc.readsurv.jpeg read survival plots

Running Lazypipe with Snakemake

Example2: analyzing sample data

Snakemake works by declaring the end file you wish to produce.

Start by listing your input fastq files under datain key in config.yaml file. Pretend each file with sample id.

For this example, we will use data/samples/M15small. In your config.yaml type:

datain:
    M15: data/samples/M15small_R1.fastq

Run main steps with default options

snakemake --cores 8 results/M15.tar.gz -p

Run preprocessing with Trimmomatic. Overwrite any trimmed reads produced by previous runs with --force:

snakemake --config pre="trimm" --cores 8 results/M15/trimmed_paired1.fq.gz --force -p

Run assembling with SPAdes. Overwrite any contigs produced by previous runs with --force:

snakemake --config ass="spades" --cores 8 results/M15/contigs.fa --force -p

Redo annotation with minimap2:

snakemake --config ann="minimap" --cores 16 results/M15.tar.gz --force -p

Confirm viral contigs with local blastn:

snakemake --config blastv=1 --cores 16 results/M15/contigs_vi.annot.xlsx -p

Search for viruses in unmapped contigs with local blastn

snakemake --config blastu=1 --cores 16 results/M15/contigs_un.annot.xlsx -p

Retrieving reads for a contig or taxid

Start by unzipping your source fasta:

gunzip -k results/M15/trimmed_paired*_hostflt.fq.gz

Retrieve all reads mapped to contig k141.100 in sample M15

bin/retrieve_reads -c k141.100 -r results/M15 -v

Retrieve all reads mapped to Mamastrovirus (taxid 1239574) in sample M15:

bin/retrieve_reads -t 1239574 -r results/M15 -v

Citing Lazypipe

  1. Plyusnin Ilya, Olli Vapalahti, Tarja Sironen, Ravi Kant, and Teemu Smura. “Enhanced Viral Metagenomics with Lazypipe 2.” Viruses 15, no. 2 (February 4, 2023): 431. https://doi.org/10.3390/v15020431

  2. Ilya Plyusnin, Ravi Kant, Anne J. Jaaskelainen, Tarja Sironen, Liisa Holm, Olli Vapalahti, Teemu Smura. (2020) Novel NGS Pipeline for Virus Discovery from a Wide Spectrum of Hosts and Sample Types. Virus Evolution, veaa091, https://doi.org/10.1093/ve/veaa091

Contact

Project website: https://www.helsinki.fi/en/projects/lazypipe

Contact email: grp-lazypipe@helsinki.fi

Updated