Wiki
Clone wikiLazypipe / UserGuide.v2
LAZYPIPE User Guide
Table of Content
- About Lazypipe
- Running on CSC
- Installing
- Running Lazypipe with lazypipe.pl
- Running Lazypipe with Snakemake
- Retrieving reads for a contig or taxid
- Citing Lazypipe
- Contact
About Lazypipe
Lazypipe is a bioinformatic pipeline for analyzing virus and bacteria metagenomics from NGS data.
Figure 1. Lazypipe workflow
Lazypipe supports:
- fastq preprocessing
- de novo assembling
- taxonomic binning
- taxonomic profiling
- reporting
- mapped contigs sorted by taxa
- virus contigs
- unmapped contigs
- contig annotations (tsv and excel)
- taxon abundancies (tsv, excel and KronaGraph)
- quality control plots
Running Lazypipe on CSC
Lazypipe can be quickly assessed using a preinstalled module at the Finnish Center of Scientific Computing.
Installing Lazypipe
Setting up directories
Create root directory for storing reference and taxonomy databases. Change /my/data/path/
according to your preferences:
mkdir /my/data/path/taxonomy
For convenience add environment variable $data
referring to your /my/data/path
. To add the variable locate the .bashrc file in your home directory and add this line to the file:
export data=/my/data/path
Load your variables (will autoload on the next login):
source ~/.bashrc
Cloning the repository
git clone https://plyusnin@bitbucket.org/plyusnin/lazypipe.git
cd lazypipe
Installing dependencies
Installing dependencies with Conda
conda create -n blast -c bioconda blast
conda create -n lazypipe -c bioconda -c eclarke bwa centrifuge csvtk fastp krona megahit mga minimap2 samtools seqkit spades snakemake-minimal taxonkit trimmomatic numpy scipy fastcluster requests
Or from conda yaml files:
conda env create -f blast.yml
conda env create -f lazypipe.yml
This will create separate conda environment for blast. All other tools are installed under lazypipe. To activate all installed binaries type:
conda activate blast
conda activate --stack lazypipe
Set taxonomy database location for KronaGraph (replace $CONDA_PREFIX and $data according to your settings):
rm -rf $CONDA_PREFIX/conda/env/lazypipe/opt/krona/taxonomy
ln -s $data/taxonomy $CONDA_PREFIX/conda/env/lazypipe/opt/krona/taxonomy
Set env variable $TM to point to trimmomatic directory:
export TM=$CONDA_PREFIX/share/trimmomatic
Download PANNZER (version 02/2022 or later) and set runsanspanz.py as executable to your path:
wget http://ekhidna2.biocenter.helsinki.fi/sanspanz/SANSPANZ.3.tar.gz
tar -zxvf SANSPANZ.3.tar.gz
sed -i "1 i #!$(which python)" SANSPANZ.3/runsanspanz.py
ln -sf $(pwd)/SANSPANZ.3/runsanspanz.py ~/bin/runsanspanz.py
Installing dependencies manually
Download and unpack dependencies listed in Table 1. Then copy or link these executables to your ~/bin folder. For example:
wget https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2
tar -xjvf minimap2-2.24_x64-linux.tar.bz2
cp minimap2-2.24_x64-linux/minimap2 ~/bin/
Note that Snakemake requires conda for installation (for details see https://snakemake.readthedocs.io/):
conda create -c bioconda -n snakemake snakemake-minimal
conda activate snakemake
Table 1: Lazypipe dependencies Tools in square brackets mark binaries that are not required for basic Lazypipe runs. When installed, these will provide additional options/functionalities.
Installing Perl modules
Install modules to local-lib ~/perl5
cpan --local-lib=~/perl5 File::Basename File::Temp Getopt::Long YAML::Tiny
export PERL5LIB=~/perl5/lib/perl5:{$PERL5LIB}
Installing R libraries
Open R console and type
install.packages( c("reshape","openxlsx") );
Installing reference databases
Download and install reference databases using Table 1 and the following instructions.
Start by installing NCBI Taxonomy. In config.yaml
set local path to taxonomy
. Then install by running:
perl perl/install_db.pl --db taxonomy
Running 1st round annotations with SANS or Minimap2 and 2nd round annotations with BLASTN (recommended):
-
SANS: no databases required
-
Minimap2: in
config.yaml
set local path tominimap_db
. Then download and unpack the latest NCBI NT abv minimap database to that location. -
BLASTN: in
config.yaml
set local path toblastn_vi_db
. Then install by running:perl perl/install_db.pl --db blastn_vi
Running 1st round annotations with BLASTP or Centrifuge:
-
BLASTP: in
config.yaml
set local path toblastp_db
. Then download your preferred BLAST database to that location. -
Centrifuge: in your
config.yaml
set local path tocentrifuge_db
. Then download and unpack NCBI NT habv centrifuge index to that location.
Running 2nd round annotations for bacteriophages:
- in
config.yaml
set local paths tominimap_db_phages
andminimap_db_phages_metadata
. Then download Gut Phage Database (GPD_sequences.fa and GPD_metadata.tsv) to these locations. -
index the database by running:
minimap2 -t 4 -x asm20 -d GPD_metadata.fa.mmi GPD_metadata.fa
URL | Local path (config.yaml) | Installation | Description |
---|---|---|---|
blast/db/ | blastp_db |
See NCBI manual | NCBI BLAST nr or similar |
ref_viruses_rep_genomes.tar.gz |
blastn_vi_db |
perl/install_db.pl --db blastn_vi |
RefSeq viruses representative genomes |
blast_gb_vi_2023_01_01.tar.gz |
blastn_vi_db |
perl/install_db.pl --db blastn_vi |
NCBI GeneBank Viruses Complete genomes |
centrifuge_db_url |
centrifuge_db |
download and unpack data/nt_2021_12_habv_cent.tar.gz |
centrifuge index with Hsapiens_GRCh38p13 assembly + bacteria + archaea + virus sequences from NCBI nt database |
minimap_db_url |
minimap_db |
download and unpack data/YYYY_MM_DD.nt_abv.tar.gz |
Archaeal, bacterial and virus sequences from NCBI nt database |
taxdump.tar.gz | taxonomy |
perl/install_db.pl --db taxonomy |
NCBI Taxonomy database dump files |
GPD_sequences.fa.gz | minimap_db_phages |
download and index | Gut phage database |
GPD_metadata.tsv | minimap_db_phages_metadata |
download | Gut phage database |
Table 2. Databases used by Lazypipe.
Test Perl and Snakemake interfaces
Perl interface
perl lazypipe.pl
Snakemake interface
snakemake -np all
Running Lazypipe with lazypipe.pl
lazypipe.pl runs your metagenomic analysis step-by-step. For example, to run preprocessing and assembling type
perl lazypipe.pl -1 data/sample_R1.fq.gz --pipe pre,ass -v
###lazypipe.pl command-line options:
Short | Long | Value | Default | Description |
---|---|---|---|---|
-1 | --read1 | file | Paired-end reads, fastq with forward reads (can be gzipped) | |
-2 | --read2 | file | guess from --read1 | Paired-end reads, fastq with reverse reads |
--se | false | Input reads are SE-reads. Any --read2 file will be ignored | ||
-r | --res | dir | results | Results will be printed to res-dir/sample-dir/ |
-s | --sample | str | --read1 prefix | Results will be printed to res-dir/sample-dir/ |
--logs | dir | logs | Logs will be printed to logs-dir/sample-dir/ | |
-t | --numth | int | 8 | Number of threads |
--pre | str | fastp | Use fastp|trimm|none to preprocess reads | |
--ass | str | megahit | Assembler: megahit|spades | |
--ann | str | sans | Homology search used for contig annotation: blastp|sans|centrifuge|minimap | |
--hostgen | file | *.fna file containing host genome. Filtering is turned on by --hostgen file -p flt | ||
--hgtaxid | taxid | NCBI taxon id for the host genome taxon. When given, hostgen filtered reads will be assigned to this taxon | ||
-w | --weights | str | bitscore | Model for abundance estimation: taxacount|bitscore|bitscore2 |
--config | file | config.yaml | Configuration file for default options | |
-v | false | Verbal mode | ||
--clean | false | Delete intermediate files after each step | ||
-p | --pipe | str | main | Comma-separated list of steps to perform, e.g. --pipe pre,flt,ass,ann,realign,sta,pack |
pre|preprocess | Preprocess reads, i.e. filter low quality reads | |||
flt|filter | Filter reads mapping to host genome using --hostgen file | |||
ass|assemble | Assemble reads to contigs | |||
rea|realign | Realign reads to contigs | |||
ann|annotate | Annotate contigs with blastp/sans/centrifuge/minimap2 against blastp_db/UniProt/centrifuge_db/minimap_db. | |||
blastv | Annotate viral contigs with blastn against custom virus database (blastn_vi_db). | |||
blastu | Annotate unmapped contigs with blastn against custom virus database (blastn_vi_db). | |||
annph | Annotate unmapped contigs with minimap against local bacteriophage database (minimap_db_phages) | |||
rep|report | Create reports: abundance/annotation tables + Krona graph + sort contigs by taxa | |||
sta|stats | Create assembly stats + QC plots | |||
pack | Pack results into a tarball. Tarball will be created to the root directory of --res dir. | |||
clean | Clean up all intermediate and temporary files. | |||
main | Run main steps: pre,flt,ass,rea,ann,rep,sta,pack,clean [default] | |||
all | Run all steps |
Default options and additional settings are defined in config.yaml
file.
Note that command line options take precedence over options in config.yaml
file.
###Additional options in config.yaml
:
Option | Value | Description |
---|---|---|
GENERAL PARAMETERS | ||
R_call |
str | Rscript or similar for calling R |
hostgen |
file | Path to host genome in fasta/fasta.gz format. Set to 0 to switch off hostgen filtering. |
hostgen_taxid |
num | NCBI taxon id for the host genome taxon. When defined, hostgen filtered reads will be assigned to this taxon |
hostgen_flt_th |
num | Minimum alignment score for filtering host genome reads |
min_gene_length |
num | Minimum ORF sequence length for reporting/mapping |
min_sans_bits |
num | Minimum alignment score for mapping ORFs with SANS |
min_blastp_bits |
num | Minimum alignment score for mapping ORFs with BLASTP |
min_cent_bits |
num | Minimum alignment score for mapping contigs with Centrifuge |
min_minimap_DPpeak_score |
num | Minimum DP alignment score for mapping contigs with Minimap2 |
realign_read_th |
num | Minimum alignment score for mapping reads to contigs with BWA MEM |
tail |
percent | Remove taxa that correspond to this percentile in abundance estimation. Set to zero to keep all predictions |
cont_score_tail |
percent | Remove taxa from contig that correspond to this percentile. Reduces noise in abundance estimation. |
trimm_par |
str | Trimmomatic parameters. NOTE: please ensure that $TM envirnoment variable is pointing to Trimmomatic installation root |
fastp_par |
str | Fastp parameters |
res |
dir | Results directory root. Results will be printed to res-dir/sample-dir/ |
logs |
dir | Logs directory root. Logs will be printed to logs-dir/sample-dir/ |
tmpdir |
dir | Temporary directory root. Each run will create a designated temporary directory at this location |
keep_tmpdir |
0/1 | Set to 1 to keep temporary directories |
DATABASES | ||
blastp_db |
$BLASTDB/nr |
Path to local NCBI blastp nr database. |
blastn_vi_db |
path | Path to local blastn nucleotide virus database. |
blastn_vi_db_url |
url | URL to blastn_vi_db resource |
centrifuge_db |
path | Path to local Centrifuge nucleotide database: e.g. h+a+b+v. |
minimap_db |
path | Path to local minimap2 nucleotide database. |
minimap_db_phage |
path | Path to local minimap2 bacteriophage database. |
minimap_db_phage_meta |
path | Patho to minimap_db_phage metainfo. Expecting TSV file with header line and first column with minimap_db_phage sequence ids. |
taxonomy |
dir | Path to local NCBI taxonomy database. Database will be installed on demand |
taxonomy_update |
0/1 | Set to 1 to update NCBI taxonomy db |
taxonomy_update_time |
num | NCBI taxonomy update frequency in days |
taxonomy_url |
str | URL to NCBI taxonomy (taxdump.tar.gz) |
SNAKEMAKE PARAMETERS | ||
datain |
input fastq files | List of input fastq libraries ordered by sample name, e.g. "M15: M15_R1.fastq". Note: only list forward reads, reverce reads are guessed by substituting _R1 suffix with _R2. |
blastn_contigs_vi |
0/1 | Similar to --pipe blastv . Annotate viral contigs with blastn against blastn_vi_db . |
blastn_contigs_un |
0/1 | Similar to --pipe blastu . Annotate unmapped contigs with blastn against blastn_vi_db . |
threads_max |
num | Maximum number of threads to use |
Example 1: analyzing sample data with lazypipe.pl
For this example, we will use data/samples/M15small_R\*.fastq
(PE Illumina reads from mink feces env sample).
Run main steps with default options
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p main -t 8
Run preprocessing with Trimmomatic
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p pre --pre trimm -t 8 -v
Filter host reads with Neovison vison genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG.v01/GCA_900108605.1_NNQGG.v01_genomic.fna.gz
mv GCA_900108605.1_NNQGG.v01_genomic.fna.gz $data/hostgen/
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -t 8 -v
Run assembling with SPAdes + realign reads to assembly
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p ass,rea --ass spades -t 8
Run annotation with minimap2 + update reports
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p ann,rep --ann minimap -t 8 -v
Confirm virus contigs with local blastn
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p blastv -t 8 -v
Search for viruses in unmapped contigs with local blastn
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p blastu -t 8 -v
Pack results to *.tar.gz
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p pack -v
Example 1: generated reports
By default, all results are printed to ./res-dir/sample-dir/
, in this case to ./results/M15/
:
Assembled contigs and predicted ORFs
file/dir | description |
---|---|
contigs | contigs sorted by taxa |
contigs.fa | contigs in a single fasta file |
contigs_un.fa | contigs with no annotation by the main homology search |
contigs_vi.fa | contigs annotated as virus sequences by the main homology search |
ORFs.gtf | predicted ORFs in GTF2.2 format |
ORFs.aa.fa | predicted ORFs as aa sequences |
ORFs.nt.fa | predicted ORFs as nt sequences |
scaffolds.fa | scaffolds, if available |
Abundance tables
Figure 2. abund_table.xlsx
Spreadsheets with taxon abundancies are printed to abund_table.xlsx
.
In the bundancies are displayed in separate tables for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots.
For each domain abundancies are displayed at three taxonomic levels: species, genus and family.
For raw abundance data see abund_table.tsv
.
Columns in abund_table.xlsx
column | description |
---|---|
readn | read pairs assigned to this taxon |
readn_pc | percentage of reads pairs assigned to this taxon |
csum | cumulative read distribution score (percentage of reads mapped to this taxon and more abundant taxa) |
csumq | confidences score based on csum (1 ~ reliable, 2 ~ intermediate, 3 ~ unreliable) |
contign | contigs assigned to this taxon |
species | species name (NCBI taxonomy) |
species_id | species taxid (NCBI taxonomy) |
genus | genus name |
genus_id | genus taxid |
family | family name |
family_id | family taxid |
Annotation tables
Figure 3. contigs.annot.xslx
Spreadsheets with contig annotations are printed to contigs.annot.xslx
.
Spreadsheets are displayed separately for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots.
Columns displayed depend on the applied homology search (sans/blastp/minimap2).
Running -p blastv
will also print blastn annotation for contigs_vi.fa
to contigs_vi.annot.xlsx
.
Running -p blastu
will also print blastn annotation for contigs_un.fa
to contigs_un.annot.xlsx
.
For raw annotation data see contigs[_un|_vi].annot.tsv
.
Key columns in contigs[_un|_vi].annot.xslx
:
column | description |
---|---|
contig | contig id |
coverage | contig coverage |
length | contig length |
ORF | orf description in start-end:strand format |
sseqid | subject sequence id |
bitscore | alignment score |
qcov | query coverage |
scov | subject coverage |
qlen | query sequence length |
slen | subject sequence length |
pide | percent identity |
lali | alignment length |
desc | subject description |
staxid | assigned taxid |
species | assigned species |
genus | assigned genus |
family | assigned family |
Krona graph
Figure 4. krona_graph.html
Estimated taxon abundancies are also displayed as an interactive Krona graph: krona_graph.html
.
Quality control plots
Figure 5. Read survival plots
Quality Control (QC) plots include length histograms for reads and contigs, and survival plots. The survival plots track retained reads after each pipeline step.
file | description |
---|---|
qc.read1.jpeg | length hist for forward reads |
qc.read2.jpeg | length hist for reverse reads |
qc.contigs.jpeg | length hist for contigs |
qc.readsurv.jpeg | read survival plots |
Running Lazypipe with Snakemake
Example2: analyzing sample data
Snakemake works by declaring the end file you wish to produce.
Start by listing your input fastq files under datain
key in config.yaml
file. Pretend each file with sample id
.
For this example, we will use data/samples/M15small
. In your config.yaml
type:
datain:
M15: data/samples/M15small_R1.fastq
Run main steps with default options
snakemake --cores 8 results/M15.tar.gz -p
Run preprocessing with Trimmomatic. Overwrite any trimmed reads produced by previous runs with --force
:
snakemake --config pre="trimm" --cores 8 results/M15/trimmed_paired1.fq.gz --force -p
Run assembling with SPAdes. Overwrite any contigs produced by previous runs with --force
:
snakemake --config ass="spades" --cores 8 results/M15/contigs.fa --force -p
Redo annotation with minimap2:
snakemake --config ann="minimap" --cores 16 results/M15.tar.gz --force -p
Confirm viral contigs with local blastn:
snakemake --config blastv=1 --cores 16 results/M15/contigs_vi.annot.xlsx -p
Search for viruses in unmapped contigs with local blastn
snakemake --config blastu=1 --cores 16 results/M15/contigs_un.annot.xlsx -p
Retrieving reads for a contig or taxid
Start by unzipping your source fasta:
gunzip -k results/M15/trimmed_paired*_hostflt.fq.gz
Retrieve all reads mapped to contig k141.100 in sample M15
bin/retrieve_reads -c k141.100 -r results/M15 -v
Retrieve all reads mapped to Mamastrovirus (taxid 1239574) in sample M15:
bin/retrieve_reads -t 1239574 -r results/M15 -v
-
Plyusnin Ilya, Olli Vapalahti, Tarja Sironen, Ravi Kant, and Teemu Smura. “Enhanced Viral Metagenomics with Lazypipe 2.” Viruses 15, no. 2 (February 4, 2023): 431. https://doi.org/10.3390/v15020431
-
Ilya Plyusnin, Ravi Kant, Anne J. Jaaskelainen, Tarja Sironen, Liisa Holm, Olli Vapalahti, Teemu Smura. (2020) Novel NGS Pipeline for Virus Discovery from a Wide Spectrum of Hosts and Sample Types. Virus Evolution, veaa091, https://doi.org/10.1093/ve/veaa091
Project website: https://www.helsinki.fi/en/projects/lazypipe
Contact email: grp-lazypipe@helsinki.fi
Updated