Wiki

LAZYPIPE User Guide

Table of Content

About Lazypipe
Running on CSC
Installing
Running Lazypipe with lazypipe.pl
Running Lazypipe with Snakemake
- Example 2: analyzing sample data
Retrieving reads for a contig or taxid
Citing Lazypipe
Contact

About Lazypipe

Lazypipe is a bioinformatic pipeline for analyzing virus and bacteria metagenomics from NGS data.

Lazypipe flowchart Figure 1. Lazypipe workflow

Lazypipe supports:

fastq preprocessing
de novo assembling
taxonomic binning
taxonomic profiling
reporting
- mapped contigs sorted by taxa
- virus contigs
- unmapped contigs
- contig annotations (tsv and excel)
- taxon abundancies (tsv, excel and KronaGraph)
quality control plots

Running Lazypipe on CSC

Lazypipe can be quickly assessed using a preinstalled module at the Finnish Center of Scientific Computing.

Installing Lazypipe

Setting up directories

Create root directory for storing reference and taxonomy databases. Change /my/data/path/ according to your preferences:

mkdir /my/data/path/taxonomy

For convenience add environment variable $data referring to your /my/data/path. To add the variable locate the .bashrc file in your home directory and add this line to the file:

export data=/my/data/path

Load your variables (will autoload on the next login):

source ~/.bashrc

Cloning the repository

git clone https://plyusnin@bitbucket.org/plyusnin/lazypipe.git
cd lazypipe

Installing dependencies

Installing dependencies with Conda

conda create -n blast -c bioconda blast
conda create -n lazypipe -c bioconda -c eclarke bwa centrifuge csvtk fastp krona megahit mga minimap2 samtools seqkit spades snakemake-minimal taxonkit trimmomatic numpy scipy fastcluster requests

Or from conda yaml files:

conda env create -f blast.yml
conda env create -f lazypipe.yml

This will create separate conda environment for blast. All other tools are installed under lazypipe. To activate all installed binaries type:

conda activate blast
conda activate --stack lazypipe

Set taxonomy database location for KronaGraph (replace $CONDA_PREFIX and $data according to your settings):

 rm -rf $CONDA_PREFIX/conda/env/lazypipe/opt/krona/taxonomy
 ln -s $data/taxonomy $CONDA_PREFIX/conda/env/lazypipe/opt/krona/taxonomy

Set env variable $TM to point to trimmomatic directory:

 export TM=$CONDA_PREFIX/share/trimmomatic

Download PANNZER (version 02/2022 or later) and set runsanspanz.py as executable to your path:

wget http://ekhidna2.biocenter.helsinki.fi/sanspanz/SANSPANZ.3.tar.gz
tar -zxvf SANSPANZ.3.tar.gz
sed -i "1 i #!$(which python)" SANSPANZ.3/runsanspanz.py
ln -sf  $(pwd)/SANSPANZ.3/runsanspanz.py ~/bin/runsanspanz.py

Installing dependencies manually

Download and unpack dependencies listed in Table 1. Then copy or link these executables to your ~/bin folder. For example:

wget https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2
tar -xjvf minimap2-2.24_x64-linux.tar.bz2
cp minimap2-2.24_x64-linux/minimap2 ~/bin/

Note that Snakemake requires conda for installation (for details see https://snakemake.readthedocs.io/):

conda create -c bioconda -n snakemake snakemake-minimal
conda activate snakemake

Tool	Website	Download binaries	Original article	CSC environment
[blast]	https://blast.ncbi.nlm.nih.gov/	blast+/LATEST/	https://doi.org/10.1186/1471-2105-10-421	biokit module
bwa-mem	https://github.com/lh3/bwa	bio-bwa/files	https://arxiv.org/abs/1303.3997	biokit module
[Centrifuge]	https://ccb.jhu.edu/software/centrifuge/	centrifuge-1.0.3-beta-Linux_x86_64.zip	https://doi.org/10.1101/gr.210641.116	NA
csvtk	https://bioinf.shenwei.me/csvtk/	csvtk/download		NA
fastp	https://github.com/OpenGene/fastp	http://opengene.org/fastp/fastp	https://doi.org/10.1093/bioinformatics/bty560	NA
KronaTools	https://github.com/marbl/Krona/wiki/KronaTools	NA	https://doi.org/10.1186/1471-2105-12-385	biokit module
MEGAHIT	https://github.com/voutcn/megahit/	IMEGAHT-1.2.9-Linux-x86_64-static.tar.gz	https://doi.org/10.1016/j.ymeth.2016.02.020	biokit module
MGA	http://metagene.nig.ac.jp/metagene/	http://metagene.nig.ac.jp/metagene/download_mga.html	https://doi.org/10.1093/nar/gkl723	NA
minimap2	https://github.com/lh3/minimap2	minimap2-2.24_x64-linux.tar.bz2	https://doi.org/10.1093/bioinformatics/bty191	biokit module
PANNZER/SANS	http://ekhidna2.biocenter.helsinki.fi/sanspanz/	SANSPANZ.3.tar.gz	https://doi.org/10.1002/pro.4193	biokit module
TaxonKit	https://bioinf.shenwei.me/taxonkit/	taxonkit/releases/tag/v0.9.0	https://doi.org/10.1016/j.jgg.2021.03.006	NA
[Trimmomatic]	https://github.com/usadellab/Trimmomatic	v0.39.tar.gz	https://doi.org/10.1093/bioinformatics/btu170	biokit module
Samtools	http://www.htslib.org/	samtools-1.14.tar.bz2	https://doi.org/10.1093/gigascience/giab008	biokit module
SeqKit	https://bioinf.shenwei.me/seqkit/	seqkit_linux_amd64.tar.gz	https://doi.org/10.1371/journal.pone.0163962	NA
Snakemake	https://snakemake.readthedocs.io/	NA	https://doi.org/10.12688/f1000research.29032.2	NA
[SPAdes]	https://github.com/ablab/spades	SPAdes-3.15.3-Linux.tar.gz	https://doi.org/10.1002/cpbi.102	biokit module

Table 1: Lazypipe dependencies Tools in square brackets mark binaries that are not required for basic Lazypipe runs. When installed, these will provide additional options/functionalities.

Installing Perl modules

Install modules to local-lib ~/perl5

cpan --local-lib=~/perl5 File::Basename File::Temp Getopt::Long YAML::Tiny
export PERL5LIB=~/perl5/lib/perl5:{$PERL5LIB}

Installing R libraries

Open R console and type

install.packages( c("reshape","openxlsx") );

Installing reference databases

Download and install reference databases using Table 1 and the following instructions.

Start by installing NCBI Taxonomy. In config.yaml set local path to taxonomy. Then install by running:

perl perl/install_db.pl --db taxonomy

Running 1st round annotations with SANS or Minimap2 and 2nd round annotations with BLASTN (recommended):

SANS: no databases required
Minimap2: in config.yaml set local path to minimap_db. Then download and unpack the latest NCBI NT abv minimap database to that location.
BLASTN: in config.yaml set local path to blastn_vi_db. Then install by running:
```
perl perl/install_db.pl --db blastn_vi
```

Running 1st round annotations with BLASTP or Centrifuge:

BLASTP: in config.yaml set local path to blastp_db. Then download your preferred BLAST database to that location.
Centrifuge: in your config.yaml set local path to centrifuge_db. Then download and unpack NCBI NT habv centrifuge index to that location.

Running 2nd round annotations for bacteriophages:

in config.yaml set local paths to minimap_db_phages and minimap_db_phages_metadata. Then download Gut Phage Database (GPD_sequences.fa and GPD_metadata.tsv) to these locations.

index the database by running:

minimap2 -t 4 -x asm20 -d GPD_metadata.fa.mmi GPD_metadata.fa

URL	Local path (config.yaml)	Installation	Description
blast/db/	`blastp_db`	See NCBI manual	NCBI BLAST nr or similar
`ref_viruses_rep_genomes.tar.gz`	`blastn_vi_db`	`perl/install_db.pl --db blastn_vi`	RefSeq viruses representative genomes
`blast_gb_vi_2023_01_01.tar.gz`	`blastn_vi_db`	`perl/install_db.pl --db blastn_vi`	NCBI GeneBank Viruses Complete genomes
`centrifuge_db_url`	`centrifuge_db`	download and unpack `data/nt_2021_12_habv_cent.tar.gz`	centrifuge index with Hsapiens_GRCh38p13 assembly + bacteria + archaea + virus sequences from NCBI nt database
`minimap_db_url`	`minimap_db`	download and unpack `data/YYYY_MM_DD.nt_abv.tar.gz`	Archaeal, bacterial and virus sequences from NCBI nt database
taxdump.tar.gz	`taxonomy`	`perl/install_db.pl --db taxonomy`	NCBI Taxonomy database dump files
GPD_sequences.fa.gz	`minimap_db_phages`	download and index	Gut phage database
GPD_metadata.tsv	`minimap_db_phages_metadata`	download	Gut phage database

Table 2. Databases used by Lazypipe.

Test Perl and Snakemake interfaces

Perl interface

perl lazypipe.pl

Snakemake interface

snakemake -np all

Running Lazypipe with lazypipe.pl

lazypipe.pl runs your metagenomic analysis step-by-step. For example, to run preprocessing and assembling type

perl lazypipe.pl -1 data/sample_R1.fq.gz --pipe pre,ass -v

###lazypipe.pl command-line options:

Short	Long	Value	Default	Description
-1	--read1	file		Paired-end reads, fastq with forward reads (can be gzipped)
-2	--read2	file	guess from --read1	Paired-end reads, fastq with reverse reads
	--se		false	Input reads are SE-reads. Any --read2 file will be ignored
-r	--res	dir	results	Results will be printed to res-dir/sample-dir/
-s	--sample	str	--read1 prefix	Results will be printed to res-dir/sample-dir/
	--logs	dir	logs	Logs will be printed to logs-dir/sample-dir/
-t	--numth	int	8	Number of threads
	--pre	str	fastp	Use fastp\|trimm\|none to preprocess reads
	--ass	str	megahit	Assembler: megahit\|spades
	--ann	str	sans	Homology search used for contig annotation: blastp\|sans\|centrifuge\|minimap
	--hostgen	file		*.fna file containing host genome. Filtering is turned on by --hostgen file -p flt
	--hgtaxid	taxid		NCBI taxon id for the host genome taxon. When given, hostgen filtered reads will be assigned to this taxon
-w	--weights	str	bitscore	Model for abundance estimation: taxacount\|bitscore\|bitscore2
	--config	file	config.yaml	Configuration file for default options
-v			false	Verbal mode
	--clean		false	Delete intermediate files after each step
-p	--pipe	str	main	Comma-separated list of steps to perform, e.g. --pipe pre,flt,ass,ann,realign,sta,pack
		pre\|preprocess		Preprocess reads, i.e. filter low quality reads
		flt\|filter		Filter reads mapping to host genome using --hostgen file
		ass\|assemble		Assemble reads to contigs
		rea\|realign		Realign reads to contigs
		ann\|annotate		Annotate contigs with blastp/sans/centrifuge/minimap2 against blastp_db/UniProt/centrifuge_db/minimap_db.
		blastv		Annotate viral contigs with blastn against custom virus database (blastn_vi_db).
		blastu		Annotate unmapped contigs with blastn against custom virus database (blastn_vi_db).
		annph		Annotate unmapped contigs with minimap against local bacteriophage database (minimap_db_phages)
		rep\|report		Create reports: abundance/annotation tables + Krona graph + sort contigs by taxa
		sta\|stats		Create assembly stats + QC plots
		pack		Pack results into a tarball. Tarball will be created to the root directory of --res dir.
		clean		Clean up all intermediate and temporary files.
		main		Run main steps: pre,flt,ass,rea,ann,rep,sta,pack,clean [default]
		all		Run all steps

Default options and additional settings are defined in config.yaml file. Note that command line options take precedence over options in config.yaml file.

###Additional options in config.yaml:

Option	Value	Description
GENERAL PARAMETERS
`R_call`	str	Rscript or similar for calling R
`hostgen`	file	Path to host genome in `fasta/fasta.gz` format. Set to `0` to switch off hostgen filtering.
`hostgen_taxid`	num	NCBI taxon id for the host genome taxon. When defined, hostgen filtered reads will be assigned to this taxon
`hostgen_flt_th`	num	Minimum alignment score for filtering host genome reads
`min_gene_length`	num	Minimum ORF sequence length for reporting/mapping
`min_sans_bits`	num	Minimum alignment score for mapping ORFs with SANS
`min_blastp_bits`	num	Minimum alignment score for mapping ORFs with BLASTP
`min_cent_bits`	num	Minimum alignment score for mapping contigs with Centrifuge
`min_minimap_DPpeak_score`	num	Minimum DP alignment score for mapping contigs with Minimap2
`realign_read_th`	num	Minimum alignment score for mapping reads to contigs with BWA MEM
`tail`	percent	Remove taxa that correspond to this percentile in abundance estimation. Set to zero to keep all predictions
`cont_score_tail`	percent	Remove taxa from contig that correspond to this percentile. Reduces noise in abundance estimation.
`trimm_par`	str	Trimmomatic parameters. NOTE: please ensure that `$TM` envirnoment variable is pointing to Trimmomatic installation root
`fastp_par`	str	Fastp parameters
`res`	dir	Results directory root. Results will be printed to `res-dir/sample-dir/`
`logs`	dir	Logs directory root. Logs will be printed to `logs-dir/sample-dir/`
`tmpdir`	dir	Temporary directory root. Each run will create a designated temporary directory at this location
`keep_tmpdir`	0/1	Set to 1 to keep temporary directories
DATABASES
`blastp_db`	`$BLASTDB/nr`	Path to local NCBI blastp nr database.
`blastn_vi_db`	path	Path to local blastn nucleotide virus database.
`blastn_vi_db_url`	url	URL to `blastn_vi_db` resource
`centrifuge_db`	path	Path to local Centrifuge nucleotide database: e.g. h+a+b+v.
`minimap_db`	path	Path to local minimap2 nucleotide database.
`minimap_db_phage`	path	Path to local minimap2 bacteriophage database.
`minimap_db_phage_meta`	path	Patho to `minimap_db_phage` metainfo. Expecting TSV file with header line and first column with `minimap_db_phage` sequence ids.
`taxonomy`	dir	Path to local NCBI taxonomy database. Database will be installed on demand
`taxonomy_update`	0/1	Set to 1 to update NCBI taxonomy db
`taxonomy_update_time`	num	NCBI taxonomy update frequency in days
`taxonomy_url`	str	URL to NCBI taxonomy (taxdump.tar.gz)
SNAKEMAKE PARAMETERS
`datain`	input fastq files	List of input fastq libraries ordered by sample name, e.g. "M15: M15_R1.fastq". Note: only list forward reads, reverce reads are guessed by substituting _R1 suffix with _R2.
`blastn_contigs_vi`	0/1	Similar to `--pipe blastv`. Annotate viral contigs with blastn against `blastn_vi_db`.
`blastn_contigs_un`	0/1	Similar to `--pipe blastu`. Annotate unmapped contigs with blastn against `blastn_vi_db`.
`threads_max`	num	Maximum number of threads to use

Example 1: analyzing sample data with lazypipe.pl

For this example, we will use data/samples/M15small_R\*.fastq (PE Illumina reads from mink feces env sample).

Run main steps with default options

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p main -t 8

Run preprocessing with Trimmomatic

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p pre --pre trimm -t 8 -v

Filter host reads with Neovison vison genome

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG.v01/GCA_900108605.1_NNQGG.v01_genomic.fna.gz
mv GCA_900108605.1_NNQGG.v01_genomic.fna.gz $data/hostgen/
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -t 8 -v

Run assembling with SPAdes + realign reads to assembly

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p ass,rea --ass spades -t 8

Run annotation with minimap2 + update reports

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p ann,rep --ann minimap -t 8 -v

Confirm virus contigs with local blastn

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p blastv -t 8 -v

Search for viruses in unmapped contigs with local blastn

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p blastu -t 8 -v

Pack results to *.tar.gz

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p pack -v

Example 1: generated reports

By default, all results are printed to ./res-dir/sample-dir/, in this case to ./results/M15/:

Assembled contigs and predicted ORFs

file/dir	description
contigs	contigs sorted by taxa
contigs.fa	contigs in a single fasta file
contigs_un.fa	contigs with no annotation by the main homology search
contigs_vi.fa	contigs annotated as virus sequences by the main homology search
ORFs.gtf	predicted ORFs in GTF2.2 format
ORFs.aa.fa	predicted ORFs as aa sequences
ORFs.nt.fa	predicted ORFs as nt sequences
scaffolds.fa	scaffolds, if available

Abundance tables

Figure 2. abund_table.xlsx

Spreadsheets with taxon abundancies are printed to abund_table.xlsx. In the bundancies are displayed in separate tables for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots. For each domain abundancies are displayed at three taxonomic levels: species, genus and family.

For raw abundance data see abund_table.tsv.

Columns in abund_table.xlsx

column	description
readn	read pairs assigned to this taxon
readn_pc	percentage of reads pairs assigned to this taxon
csum	cumulative read distribution score (percentage of reads mapped to this taxon and more abundant taxa)
csumq	confidences score based on csum (1 ~ reliable, 2 ~ intermediate, 3 ~ unreliable)
contign	contigs assigned to this taxon
species	species name (NCBI taxonomy)
species_id	species taxid (NCBI taxonomy)
genus	genus name
genus_id	genus taxid
family	family name
family_id	family taxid

Annotation tables

Figure 3. contigs.annot.xslx

Spreadsheets with contig annotations are printed to contigs.annot.xslx. Spreadsheets are displayed separately for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots. Columns displayed depend on the applied homology search (sans/blastp/minimap2).

Running -p blastv will also print blastn annotation for contigs_vi.fa to contigs_vi.annot.xlsx.

Running -p blastu will also print blastn annotation for contigs_un.fa to contigs_un.annot.xlsx.

For raw annotation data see contigs[_un|_vi].annot.tsv.

Key columns in contigs[_un|_vi].annot.xslx:

column	description
contig	contig id
coverage	contig coverage
length	contig length
ORF	orf description in start-end:strand format
sseqid	subject sequence id
bitscore	alignment score
qcov	query coverage
scov	subject coverage
qlen	query sequence length
slen	subject sequence length
pide	percent identity
lali	alignment length
desc	subject description
staxid	assigned taxid
species	assigned species
genus	assigned genus
family	assigned family

Krona graph

Figure 4. krona_graph.html

Estimated taxon abundancies are also displayed as an interactive Krona graph: krona_graph.html.

Quality control plots

QC plots for a number of samples Figure 5. Read survival plots

Quality Control (QC) plots include length histograms for reads and contigs, and survival plots. The survival plots track retained reads after each pipeline step.

file	description
qc.read1.jpeg	length hist for forward reads
qc.read2.jpeg	length hist for reverse reads
qc.contigs.jpeg	length hist for contigs
qc.readsurv.jpeg	read survival plots

Running Lazypipe with Snakemake

Example2: analyzing sample data

Snakemake works by declaring the end file you wish to produce.

Start by listing your input fastq files under datain key in config.yaml file. Pretend each file with sample id.

For this example, we will use data/samples/M15small. In your config.yaml type:

datain:
    M15: data/samples/M15small_R1.fastq

Run main steps with default options

snakemake --cores 8 results/M15.tar.gz -p

Run preprocessing with Trimmomatic. Overwrite any trimmed reads produced by previous runs with --force:

snakemake --config pre="trimm" --cores 8 results/M15/trimmed_paired1.fq.gz --force -p

Run assembling with SPAdes. Overwrite any contigs produced by previous runs with --force:

snakemake --config ass="spades" --cores 8 results/M15/contigs.fa --force -p

Redo annotation with minimap2:

snakemake --config ann="minimap" --cores 16 results/M15.tar.gz --force -p

Confirm viral contigs with local blastn:

snakemake --config blastv=1 --cores 16 results/M15/contigs_vi.annot.xlsx -p

Search for viruses in unmapped contigs with local blastn

snakemake --config blastu=1 --cores 16 results/M15/contigs_un.annot.xlsx -p

Retrieving reads for a contig or taxid

Start by unzipping your source fasta:

gunzip -k results/M15/trimmed_paired*_hostflt.fq.gz

Retrieve all reads mapped to contig k141.100 in sample M15

bin/retrieve_reads -c k141.100 -r results/M15 -v

Retrieve all reads mapped to Mamastrovirus (taxid 1239574) in sample M15:

bin/retrieve_reads -t 1239574 -r results/M15 -v

Citing Lazypipe

Plyusnin Ilya, Olli Vapalahti, Tarja Sironen, Ravi Kant, and Teemu Smura. “Enhanced Viral Metagenomics with Lazypipe 2.” Viruses 15, no. 2 (February 4, 2023): 431. https://doi.org/10.3390/v15020431
Ilya Plyusnin, Ravi Kant, Anne J. Jaaskelainen, Tarja Sironen, Liisa Holm, Olli Vapalahti, Teemu Smura. (2020) Novel NGS Pipeline for Virus Discovery from a Wide Spectrum of Hosts and Sample Types. Virus Evolution, veaa091, https://doi.org/10.1093/ve/veaa091

Contact

Project website: https://www.helsinki.fi/en/projects/lazypipe

Contact email: grp-lazypipe@helsinki.fi