Wiki
Clone wikiLazypipe / UserGuide.v3.0
Lazypipe User Guide
Running Lazypipe v3.0 on a Linux cluster
Table of Content
About Lazypipe
Lazypipe is a bioinformatic pipeline for analyzing virus and bacteria metagenomics from NGS data.
Figure 1. Lazypipe workflow
Lazypipe supports:
- fastq preprocessing
- de novo assembling
- taxonomic binning
- taxonomic profiling
- reporting
- mapped contigs sorted by taxa
- virus contigs
- unmapped contigs
- contig annotations (tsv and excel)
- taxon abundancies (tsv and excel)
- quality control plots
Running Lazypipe on CSC
Lazypipe can be quickly assessed using a preinstalled module at the Finnish Center of Scientific Computing.
Installing Lazypipe
Setting up directories
Create root directory $data
and subdirectories for storing reference databases, NCBI taxonomy, host genomes and Lazypipe results (change /my/data/path/
according to your preferences):
data=/my/data/path
mkdir -p $data $data/databases $data/taxonomy $data/hostgen $data/results
For convenience add environment variable $data
pointing to your root directory. To add the variable locate .bashrc file in your home directory and add this line to the file:
export data=/my/data/path
Cloning the repository
git clone https://plyusnin@bitbucket.org/plyusnin/lazypipe.git
cd lazypipe
Installing dependencies
Installing dependencies with Conda
We recommend installing BLAST under a separate Conda environment labeled blast
:
conda create -n blast -c bioconda blast
All other dependencies can be installed under environment labeled Lazypipe
:
conda create -n lazypipe -c bioconda -c eclarke bwa csvtk fastp krona megahit mga minimap2 samtools seqkit spades taxonkit trimmomatic numpy scipy requests
Mac users installing to M1/M2 ARM64 architecture: Prior to installing bio-packages configure Conda with conda config --add subdirs osx-64
. You may also need to install MGA
binary manually (see Table 1).
To activate all installed dependencies type:
conda activate blast
conda activate --stack lazypipe
Set taxonomy database location for KronaGraph:
rm -rf $CONDA_PREFIX/conda/env/lazypipe/opt/krona/taxonomy
ln -s $data/taxonomy $CONDA_PREFIX/conda/env/lazypipe/opt/krona/taxonomy
Set env variable $TM to point to trimmomatic directory:
export TM=$CONDA_PREFIX/share/trimmomatic
Download PANNZER (version 02/2022 or later) and set runsanspanz.py as executable to your path:
wget http://ekhidna2.biocenter.helsinki.fi/sanspanz/SANSPANZ.3.tar.gz
tar -zxvf SANSPANZ.3.tar.gz
echo '#!'$(which python) 1> SANSPANZ.3/runsanspanz.ex.py
cat SANSPANZ.3/runsanspanz.py >> SANSPANZ.3/runsanspanz.ex.py
chmod 755 SANSPANZ.3/runsanspanz.ex.py
ln -sf $(pwd)/SANSPANZ.3/runsanspanz.ex.py ~/bin/runsanspanz.py
Installing dependencies manually
Download and unpack dependencies listed in Table 1. Then copy or link these executables to your ~/bin folder. For example:
wget https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2
tar -xjvf minimap2-2.24_x64-linux.tar.bz2
cp minimap2-2.24_x64-linux/minimap2 ~/bin/
Table 1: Lazypipe dependencies Tools in square brackets mark binaries that are not required for basic Lazypipe runs. When installed, these will provide additional options/functionalities.
Installing Perl modules
Install modules to local-lib ~/perl5
cpan --local-lib=~/perl5 File::Basename File::Temp Getopt::Long YAML::Tiny
export PERL5LIB=~/perl5/lib/perl5:{$PERL5LIB}
Installing R libraries
Open R console and type
install.packages( c("reshape","openxlsx") );
Installing reference databases
Install NCBI Taxonomy to default location ($data/taxonomy
) by running:
perl perl/install_db.pl --db taxonomy
Download and unpack reference databases for 1st and 2nd round annotations. You can choose to install RefSeq/UniRef100 databases (Table 2), NT/UniRef100 databases (Table 3) or Viral databases (Table 4). RefSeq/UniRef100 databases are suited for annotating established taxa with small disk and time overhead. NT/UniRef100 databases have better coverage for novel taxa and may produce more accurate annotations, however the disk/time overhead is also higher. Viral databases are small databases intended for annotating only viral taxa with minimum disk/time overhead.
Use install_db.pl
script to install databases from URLs listed in config.yaml
.
Too install RefSeq/UniRef100 databases to default path ($data/databases
) call:
perl perl/install_db.pl --db minimap.refseq.abv -v
perl perl/install_db.pl --db minimap.refseq.vi -v
perl perl/install_db.pl --db blastn.refseq.ab -v
perl perl/install_db.pl --db blastn.refseq.vi -v
perl perl/install_db.pl --db blastp.vi -v
URL | Size *.gz (GB) | Description |
---|---|---|
minimap.refseq.abv.release221.tar.gz | 5.7 | Minimap2 index for RefSeq archaea, bacteria and viruses |
minimap.refseq.vi.release221.tar.gz | 0.16 | Minimap2 index for RefSeq viruses |
blastn.refseq.ab.release221.tar.gz | 4.4 | BLASTN index for RefSeq archaea and bacteria |
blastn.refseq.vi.release221.tar.gz | 0.14 | BLASTN index for RefSeq viruses |
blastp.uniref100.vi.2024_01_24.tar.gz | 0.48 | BLASTP index for UniRef100 viruses |
Table 2: RefSeq/UniRef100 databases
URL | Size *.gz (GB) | Description |
---|---|---|
minimap.nt.abv.tar.gz | 77 | Minimap2 index for NCBI NT archaea, bacteria and viruses |
blastn.nt.ab.tar.gz | 44 | BLASTN index for NCBI NT archaea and bacteria |
blastn.nt.vi.tar.gz | 9.7 | BLASTN index for NCBI NT viruses |
blastp.uniref100.ab.tar.gz | 33 | BLASTP index for UniRef100 archaea and bacteria |
blastp.uniref100.vi.release.tar.gz | 0.48 | BLASTP index for UniRef100 viruses |
Table 3: NT/UniRef100 databases
URL | Size *.gz | Description |
---|---|---|
minimap.refseq.vi.tar.gz | 160 MB | Minimap2 index for RefSeq viruses |
blastn.refseq.vi.tar.gz | 130 MB | BLASTN index for RefSeq viruses |
blastp.uniref100.vi.tar.gz | 480 MB | BLASTP index for UniRef100 viruses |
Table 4: Viral databases
Open config.yaml
and check that database paths match the location and version of the installed databases. Edit these line in config.yaml
:
ann1.databases:
minimap.nt: "$data/databases/nt.abv.2024_01_01.fa"
minimap.refseq: "$data/databases/refseq.abv.release221.fa"
..
ann2.databases:
blastn.ab.nt: "$data/databases/blastn.nt.ab.2024_01_01"
blastn.vi.nt: "$data/databases/blastn.nt.vi.2024_01_01"
..
If you wish to annotate bacteriophages, specify minimap.ph
, blastn.ph
and blastp.ph
in your config.yaml
. Use blastn/blastp virus databases or your custom bacteriophage databases:
ann1.databases:
minimap.ph: $data/databases/minimap.GPD.ph.fasta
blastn.ph: $data/databases/blastn.nt.vi.2024_01_01
blastp.ph: $data/databases/blastp.uniref100.vi.2024_01_24
ann2.databases:
blastn.ph: $data/databases/blastn.nt.vi.2024_01_24
blastp.ph: $data/databases/blastp.uniref100.vi.2024_01_24
Naming convention for reference databases
Reference sequence databases are defined in config.yaml
as key-value pairs under ann1.databases
and ann2.databases
. Here, each key is a string referring to the SearchTool and TargetTaxa, and each value is a path to the applied database. For the 1st round, annotation keys are named SearchTool[.dbid] and for the 2nd round SearchTool.TargetTaxa[.dbid]. In both rounds SearchTool can be sans
, minimap
, blastn
or blastp
. TargetTaxa can be abv
(ie Archaea, Bacteria and Viruses), ab
(ie Archaea and Bacteria), vi
(Viruses), ph
(Bacteriophages) or un
(unmapped). You can use an optional dbid string to differentiate between similar databases. For any annotation step you can use any database, default or custom. For BLASTN/BLASTP use blast indices. For minimap2 you can use .fasta* or .mmi files; note that these must have an accomponing .acc2taxid tsv-map (see default minimap2 databases for an example).
Running Lazypipe
Example 1
In this example we will use a sample PE library that is included with the repository (data/M15small_R*.fastq
).
Preprocess reads with fastp:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe pre -t 8 -v
Download Neovison vison genome and use it to filter host reads. Note that running host filtering with a newly downloaded genome will take some time to index the genome:
mkdir -p $data/hostgen
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG.v01/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -P $data/hostgen/
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -t 8 -v
Run assembling with Megahit and realign reads to assembly
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ass,rea --ass megahit -t 8 -v
Run 1st round annotation with Minimap2 against your local minimap.refseq
database:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 minimap.refseq -t 8 -v
Run 1st round annotation with SANSparallel against UniProt TrEMBL. Note that SANSparallel runs on a remote server and requires internet connection. Append results to Minimap2 annotations from the previous step:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 sans --append -t 8 -v
Now run a more complex 1st round annotation. Start by mapping contigs with Minimap2, then map unmapped contigs with SANSparallel then map unmapped contigs with BLASTN against blastn.vi database. Note that without --append
flag this will overwrite existing 1st round annotations:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 minimap.refseq,sans,blastn.vi -t 8 -v
Run 2nd round annotation. In the second round you can target archaeal+bacterial (=ab), bacteriophage (=ph), viral (=vi) and unmapped (=un) contigs, based on labeling from the 1st round. Local databases for the 2nd round annotations are defined in ann2.databases
section of the config.yaml
. For example, to map viral contigs with BLASTN and BLASTP against local viral databases type:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.vi.refseq,blastp.vi -t 8 -v
Run 2nd round annotation for bacteria with BLASTN. Append results to BLASTN and BLASTP annotations from the previous step:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.ab.refseq --append -t 8 -v
You can also combine these runs in any order. For example:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.ab.refseq,blastn.vi.refseq,blastp.vi -t 8 -v
The most common combinations of 1st and 2nd round annotations can be saved to config.yaml
in the ann.strategies
section. Each annotation strategy is saved as a key-value pair. There are several annotation strategies predifined:
abv.fast
-- run only the 1st round with Minimap2 against RefSeq.abvabv.nt
-- 1st round: Minimap2 against NT.abv, 2nd round: BLASTN viral reads against NT.vi and archaeal+bacterial reads against NT.ababv.refseq
-- 1st round: Minimap2 against RefSeq.abv, 2nd round: BLASTN viral reads against RefSeq.vi and archaeal+bacterial reads against RefSeq.ababv.extend
-- 1st round: Minimap2 against NT.abv + SANSparallel unmapped reads against TrEMBL, 2nd round: BLASTN viral reads against NT.vi and archaeal+bacterial reads against NT.ab, additionally BLASTP viral reads against UniRef100.vi and archaeal+bacterial reads against UniRef100.abvi.nt
-- 1st round: Minimap2 against NT.vi, 2nd round: BLASTN viral reads against NT.vivi.refseq
-- 1st round: Minimap2 against RefSeq.vi, 2nd round: BLASTN viral reads against RefSeq.vi
Generate reports based on created annotations:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe rep -t 8 -v
Generate assembly stats, pack for sharing and remove temporary files:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p stats,pack,clean -t 8 -v
For convenience, routine analysis steps (pre,flt,ass,rea,ann1,ann2,rep,sta,pack,clean
) can be called with main
tag. To run main analysis with abv.refseq
annotation strategy type:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p main --anns abv.refseq -t 8 -v
Example 1: generated reports
Results are output to $res/$sample
. Default value for $res
is set in config.yaml
and default value for $sample
is created from the name of the input reads. These can be changed during runtime with --res mydir --sample mysample
.
In example 1 results were output to $data/results/M15small
.
Assembled contigs and predicted ORFs
File or Directory | Description |
---|---|
contigs | contigs sorted by taxa |
contigs.fa | contigs in a single fasta file |
contigs.ann1.ab.fa | archaeal+bacterial contigs (based on 1st round annotation) |
contigs.ann1.ph.fa | bacteriophage contigs (1st round) |
contigs.ann1.vi.fa | viral contigs (1st round) |
contigs.ann1.un.fa | unmapped contigs (1st round) |
contigs.ann2.ab.fa | archaeal+bacterial contigs (2nd round) |
contigs.ann2.ph.fa | bacteriophage contigs (2nd round) |
contigs.ann2.vi.fa | viral contigs (2nd round) |
contigs.ann2.un.fa | unmapped contigs (2nd round) |
contigs.orfs.aa.fa | predicted ORFs as aa sequences |
contigs.orfs.nt.fa | predicted ORFs as nt sequences |
scaffolds.fa | scaffolds, if available |
Table 5: Lazypipe results: contigs and ORFs.
Abundance tables
Figure 2. abund_table.xlsx
Spreadsheets with taxon abundancies are printed to abund_table.xlsx
.
Abundancies are displayed in separate tables for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots.
For each domain abundancies are displayed at three taxonomic levels: species, genus and family.
For raw abundance data see abund_table.tsv
.
column | description |
---|---|
readn | read pairs assigned to this taxon |
readn_pc | percentage of reads pairs assigned to this taxon |
csum | cumulative read distribution score (percentage of reads mapped to this taxon and more abundant taxa) |
csumq | confidences score based on csum (1 ~ reliable, 2 ~ intermediate, 3 ~ unreliable) |
contign | contigs assigned to this taxon |
species | species name (NCBI taxonomy) |
species_id | species taxid (NCBI taxonomy) |
genus | genus name |
genus_id | genus taxid |
family | family name |
family_id | family taxid |
Table 6: Columns in abund_table.xlsx
Annotation tables
Figure 3. annot_table.xslx
Spreadsheets with contig annotations are printed to contig_annot.xslx
.
Spreadsheets are displayed separately for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots.
For raw annotation data see contigs_annot.tsv
.
column | description |
---|---|
search | applied database search (e.g. blastn) |
db | applied database (e.g. UniRef100.vi) |
dbtype | nucl for nucleotide and prot for protein databases |
contig | contig id |
orf | orf description in start-end:strand format |
clen | contig length |
sseqid | subject sequence id |
bitscore | alignment score |
alen | alignment length |
pident | percent identity |
qlen | query sequence length |
qcov | query coverage |
slen | subject sequence length |
scov | subject coverage |
staxid | subject sequence taxid |
sname | subject sequence name |
bphage | yes for bacteriophage staxids |
species | assigned species |
genus | assigned genus |
family | assigned family |
order | assigned order |
class | assigned class |
Table 7: Columns in contigs_annot.xslx
Quality control plots
Figure 5. Quality control plogs
Quality Control (QC) plots include length histograms for reads and contigs, and survival plots. The survival plots track retained reads after each pipeline step.
file | description |
---|---|
qc.read1.jpeg | length hist for forward reads |
qc.read2.jpeg | length hist for reverse reads |
qc.contigs.jpeg | length hist for contigs |
qc.readsurv.jpeg | read survival plots |
Table 8: Quality Control plots
Retrieving reads for a contig or taxid
Start by unzipping your source fasta:
gunzip -k results/M15small/read*.trim.fq.gz
To retrieve all reads mapped to contig k99.17 type:
bin/retrieve_reads -r results/M15small -v -c k99.17
To retrieve all reads mapped to Circovirus mink use the following command. Note that the exact species name may change with taxonomy updates.
bin/retrieve_reads -r results/M15small -v -s "Circovirus mink"
To retrieve all reads mapped to staxid 1239574 (Mamastrovirus) type:
bin/retrieve_reads -r results/M15small -v -t 1239574
Command line options
Short | Long | Value | Default | Description |
---|---|---|---|---|
INPUT: | ||||
-1 | --read1 |
file | PE reads, fastq with forward reads (can be gzipped) | |
-2 | --read2 |
file | guess from --read1 |
PE reads, fastq with reverse reads (can be gzipped) |
--se |
false | Input reads are SE-reads. Any --read2 file will be ignored | ||
--hostgen |
file | *.fna file containing host genome. To filter host reads use --hostgen file -p flt |
||
--hgtaxid |
taxid | Map host reads to this taxid | ||
--config |
file | config.yaml |
Configuration file with default options | |
OUTPUT: | ||||
--logs |
dir | logs | Logs will be printed to $logs/$sample/ |
|
-r | --res |
dir | results | Results will be printed to $res/$sample/ |
-s | --sample |
str | --read1 prefix |
Results will be printed to $res/$sample/ |
PARAMETERS: | ||||
-p | --pipe |
str | main | Comma-separated list of steps to perform, e.g. --pipe pre,flt,ass,ann,realign,sta,pack |
pre/preprocess | Preprocess reads, i.e. filter low quality reads | |||
flt/filter | Filter reads mapping to host genome using --hostgen file | |||
ass/assemble | Assemble reads to contigs | |||
rea/realign | Realign reads to contigs | |||
ann1/annot1 | Run 1st round annotation | |||
ann2/annot2 | Run 2nd round annotation | |||
rep/report | Create reports | |||
sta/stats | Create assembly stats + QC plots | |||
pack | Pack results into a *tar.gz in the root result directory |
|||
clean | Remove all intermediate/temporary files | |||
main | Run main steps: pre,flt,ass,rea,ann1,ann2,rep,sta,pack,clean |
|||
--ann1 |
key | minimap,sans | List of keys defining 1st round annotation | |
MUST be in format: $search[.$dbid] , where: |
||||
$search is a valid database search (blastn,blastp,minimap or sans) |
||||
$dbid is a reference database id (optional) |
||||
For each key their MUST be a database defined in config.yaml |
||||
--ann2 |
key | blastn.vi,blastp.vi | List of keys defining 2nd round annotations | |
MUST be in format: $search.$target[.$dbid] , where: |
||||
$search is a valid database search (blastn,blastp,minimap or sans) |
||||
$target is a valid target (ab = Archaea+Bacteria, ph = Bacteriophages, vi = Viruses, un = Unmapped) |
||||
$dbid is a reference database id (optional) |
||||
For each key their MUST be a database defined in config.yaml |
||||
--anns |
key | Apply annotation-strategy defined in config.yaml under the supplied key. Overrides any --ann1/ann2 options |
||
--ass |
str | megahit | Assembler: megahit/spades | |
--gen |
str | mga | Gene prediction: mga/prod | |
--pre |
str | pre | Use fastp/trimm/none to preprocess reads | |
--clean |
false | Delete intermediate files after each step | ||
-t | --numth |
int | 8 | Number of threads |
-w | --wmodel |
str | bitscore | Weighting model for abundance estimation: taxacount/bitscore/bitscore2 |
-v | false | Verbal mode |
Table 9: Lazypipe command line options.
Default options and additional settings are defined in config.yaml
file.
Note that command line options take precedence over options in config.yaml
file.
Additional options in config.yaml
:
Option | Value | Description |
---|---|---|
GENERAL PARAMETERS | ||
R_call |
str | Rscript or similar for calling R |
min_read2hostgen_score |
num | Minimum alignment score for read mapping to hostgen |
min_orf_length |
num | Minimum ORF sequence length for reporting/mapping |
min_sans_bits |
num | Minimum alignment score for mapping with SANSparallel |
min_blastp_bits |
num | Minimum alignment score for mapping with BLASTP |
min_blastn_bits |
num | Minimum alignment score for mapping with BLASTN |
min_minimap_DPpeak_score |
num | Minimum alignment score for contig mapping with minimap2 |
min_read2contig_score |
num | Minimum alignment score for read mapping to contigs |
fastp_par |
str | Fastp parameters |
trimm_par |
str | Trimmomatic parameters. NOTE: please ensure that $TM envirnoment variable is pointing to Trimmomatic installation root |
tail |
percent | Remove taxa that correspond to this percentile in abundance estimation. Set to zero to keep all predictions |
tail_contig |
percent | Remove taxa from contig that correspond to this percentile. Reduces noise in abundance estimation. |
trimm_sample_name |
0/1 | When setting sample-name from read1-name, trimm read1-name to the first occurance of "_" |
DEFAULT COMMAND LINE OPTIONS | ||
See Command Line Options | ||
DATABASES | ||
ann1.databases: |
Reference databases for the 1st round annotations | |
minimap |
path | Local Minimap2 database. Specify path to *.fasta or *.fasta.mmi file. This MUST be accomponied with *.acc2taxid tsv-file (see default Minimap2 databases for example) |
blastn[.dbid] |
path | Local blastn database. To specify several blastn databases use optional dbid (eg blastn.abv ) |
blastp[.dbid] |
path | Local blastp database. To specify several blastp databases use optional dbid (eg blastp.viruses ) |
ann2.databases: |
Reference databases for the 2nd round annotations | |
$search.$target[.dbid] |
path | Generally use any valid dbsearch (minimap/blastn/blastp) and any valid target (ab/ph/vi/un) to specify databases for the 2nd round annotations |
blastn.vi[.dbid] |
path | Local blastn database targeting viral sequences. To specify several databases for the same target use dbid |
blastp.vi[.dbid] |
path | Local blastp database targeting viral sequences. To specify several databases for a target use dbid |
taxonomy |
dir | Path to local NCBI taxonomy database. Database will be installed on demand |
taxonomy_update |
0/1 | Set to 1 to update NCBI taxonomy db |
taxonomy_update_time |
num | NCBI taxonomy update frequency in days |
urls: |
Urls for retrieving databases | |
taxonomy |
str | URL to NCBI taxonomy (taxdump.tar.gz). This MUST be defined |
Table 10: Default options in in config.yaml
-
Plyusnin Ilya, Olli Vapalahti, Tarja Sironen, Ravi Kant, and Teemu Smura. “Enhanced Viral Metagenomics with Lazypipe 2.” Viruses 15, no. 2 (February 4, 2023): 431. https://doi.org/10.3390/v15020431
-
Ilya Plyusnin, Ravi Kant, Anne J. Jaaskelainen, Tarja Sironen, Liisa Holm, Olli Vapalahti, Teemu Smura. (2020) Novel NGS Pipeline for Virus Discovery from a Wide Spectrum of Hosts and Sample Types. Virus Evolution, veaa091, https://doi.org/10.1093/ve/veaa091
Project website: https://www.helsinki.fi/en/projects/lazypipe
Contact email: grp-lazypipe@helsinki.fi
Updated