Running Lazypipe on Puhti

Welcome to the Running Lazypipe on Puhti. This module is intended for practicing basic NGS analysis with Lazypipe 3.0 on CSC Puhti supercluster. In this module you will learn to:

set up working environment on CSC Puhti
run Lazypipe analysis with lazypipe.pl
run Lazypipe analysis with sbatch-lazypipe
save/share your results with Fairdata IDA

Prerequisites:

account on CSC Puhti
Lazypipe 3.0 CSC module
no experience with Unix command line or NGS analysis is required

For more information please refer to these guides:

Exercise 1: setting up working environment

In this exercise you will setup working environment for running Lazypipe on CSC Puhti.

Connecting to CSC Puhti server

Users new to Unix/CSC working environment

Both MacOS and Windows users can access Puhti via Puhti web-interface. We recommend this option for all users that are new to Unix/CSC working environment:

Login to Puhti web-interface by following the link: Puhti web-interface
From the main Dashboard click on "Login node shell" to open the terminal

Users familiar with Unix/CSC working on MacOS

MacOS users can connect to Puhti with ssh client from Terminal.

start by opening Terminal utility
From Terminal select Shell, New Window and Basic (black on white layout) or Homebrew (white on black layout).
In the terminal type (change username to your username):

ssh -X username@puhti.csc.fi -l username

Users familiar with Unix/CSC working on Windows

Download and install Putty SSH client for windows from https://www.putty.org

Start Putty. You will see a window with connection settings. In the “Host Name (or IP address)” field, type:

puhti.csc.fi

Make sure that the “Connection type” is SSH. Click “Open”. A small window will appear where you are asked to enter your username and password.

Setting up working environment

After you have logged in to Puhti continue working in your terminal. Work through the exercises by copy-pasting or typing commands to your terminal and hitting enter.

Start by checking which projects you have access to:

csc-workspaces

As an example we will use project project_2002989. However, you can use any project you have access to.

CSC supercomputers have three main disk areas: home, projappl and scratch. For a short intro see CSC Disk Areas. We will create directories for data in the scratch and one directory for the Lazypipe application in the projappl disk areas. In the following examples we will use variable $USER that will be automatically substituted for your username. Thus, you can copy-paste the example commands without editing to your terminal.

Create data directory named $USER in the project´s scratch disk area. Create subdirectories data, results and hostgen:

mkdir /scratch/project_2002989/$USER/
mkdir /scratch/project_2002989/$USER/data
mkdir /scratch/project_2002989/$USER/results
mkdir /scratch/project_2002989/$USER/hostgen

Create application directory named $USER in the project´s projappl disk area. Create subdirectory named "lazypipe":

mkdir /projappl/project_2002989/$USER/
mkdir /projappl/project_2002989/$USER/lazypipe

It is convenient to define environment variables referring to your directories. To do this you will need to edit .bashrc file in your home directory. In the Puhti web-interface navigate to your "Home Directory". Click "Show Dotfiles" checbox at the top of your file list. Locate .bashrc file and start editing by clicking on the menu next to the file name and selecting Edit.

In the .bashrc file add the following two lines and save the file by clicking Save button at the top left.

export data=/scratch/project_2002989/$USER
export lazypipe=/projappl/project_2002989/$USER/lazypipe

Now open the same file in the terminal with unix less. To navigate less use up/down arrows, to exit less type q. You should see the added lines in the .bashrc file.

less ~/.bashrc

Load your variables (will autoload on the next login):

source ~/.bashrc

You should now have variables $data and $lazypipe available on the command line. Check that these variables exist and point to the right directories by using echo:

echo $data
echo $lazypipe

These should print full paths to your data and application directories:

/scratch/project_2002989/$USER
/projappl/project_2002989/$USER/lazypipe

Now check that the directories exist by listing directory content with ls (note that \$lazypipe remains empty at this point) :

ls $data
ls $lazypipe

Loading modules and creating config.yaml

Go to your Lazypipe application directory and load required modules

cd $lazypipe
module load r-env-singularity
module load biokit
module load lazypipe

Copy default config.yaml file to your application directory. Then set tmpdir Lazypipe variable to point to your application directory and set taxonomy variable to point to taxonomy subdirectory:

cd $lazypipe
cp /appl/soft/bio/lazypipe/3.0/lazypipe/config.yaml config.yaml
echo tmpdir:  "$lazypipe" >> config.yaml
echo taxonomy:  "$lazypipe/taxonomy" >> config.yaml

Testrun lazypipe.pl: the command should print command-line usermanual:

lazypipe.pl -h

Exercise 2: Running Lazypipe with lazypipe.pl

In this exercise you will get familiar with basic Lazypipe commands.

According to CSC user policy: “The login nodes can be used for light pre- and postprocessing, compiling applications and moving data. All other tasks are to be done on the compute nodes using the batch job system.”

We will run this example on the login node because it is small scale.

Start by copying sample PE data to your $data/data directory:

cp /appl/soft/bio/lazypipe/3.0/lazypipe/data/samples/M15small_R* $data/data/

Run read preprocessing:

cd $lazypipe
lazypipe.pl -1 $data/data/M15small_R1.fastq --pipe pre -t 8 -v

Run host filtering. Start by downloading Neovison vison genome. Note that filtering host reads with a newly downloaded genome will take some time to index the genome. If short for time you can skip this step.

mkdir -p $data/hostgen
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG.v01/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -P $data/hostgen/
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -t 8 -v

Run assembling:

lazypipe.pl -1 $data/data/M15small_R1.fastq -p ass -t 8 -v

Run read realignment to the created assembly:

lazypipe.pl -1 $data/data/M15small_R1.fastq -p rea -t 8 -v

Run 1st round annotation with Minimap2 against minimap.refseq database defined in config.yaml (by default this points to RefSeq archaea+bacteria+viruses):

perl lazypipe.pl -1 $data/data/M15small_R1.fastq -p ann1 --ann1 minimap.refseq -t 8 -v

Run 1st round annotation with SANSparallel against UniProt TrEMBL. Note that SANSparallel runs on a remote server and requires internet connection. Append results to Minimap2 annotations from the previous step:

perl lazypipe.pl -1 $data/data/M15small_R1.fastq -p ann1 --ann1 sans --append -t 8 -v

Run 2nd round annotation. In the second round you can target archaeal+bacterial (=ab), bacteriophage (=ph), viral (=vi) and unmapped (=un) contigs, based on labeling from the 1st round. Local databases for the 2nd round annotations are defined in ann2.databases section of the config.yaml. For example, to map viral contigs with BLASTN against blastn.vi.refseq (RefSeq viruses) type:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann2 --ann2 blastn.vi.refseq -t 8 -v

Use annotation strategies defined in the config.yaml to run common combinations of 1st and 2nd round annotations. For example, --anns vi.refseq annotation strategy is equivalent to --ann1 minimap.vi.refseq --ann2 blastn.vi.refseq. To run type:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq --anns abv.refseq -t 8 -v

Generate reports based on created annotations:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p rep -t 8 -v

Generate assembly stats, pack for sharing and remove temporary files:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p stats,pack,clean -t 8 -v

Use main tag to refer to the main pipeline steps (pre,ass,rea,ann,rep,stats,pack,clean). For example, to run the whole pipeline annotating only viral sequences against RefSeq type:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p main --anns vi.refseq -t 8 -v --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz

Your results are output to $res/$sample, where $res is the root result directory and $sample is the input sample name. By default results are output to results/read1-filename. Check the content of your result directory:

ls -l results
ls -l results/M15small*

Exercise 3: Running Lazypipe with sbatch-lazypipe

sbatch-lazypipe is a help tool that automatically generates a configuration file and a batch job file for a Lazypipe run and submits the job to batch job system of Puhti. The command uses the same command line options as the lazypipe.pl command. In addition sbatch-lazypipe asks user to define batch job resources (account, run time, memory, number of cores). The required memory and time will depend on the size of your input library. As a rule of thumb we recommend using 5GB of memory per core (e.g. 80GB for 16 cores).

Run default analysis for M15small_R1.fastq sample and output results to $data/results/M15_ex3. Note that in the following call main pipeline steps (pre,ass,rea,ann,rep,stats,pack,clean) are referred using main tag. When prompted, set run-time to 1 hour (1:00:0), memory to default (~32 GB) and cores to 8.

sbatch-lazypipe -1 $data/data/M15small_R1.fastq -p main --anns abv.refseq -r $data/results -s M15_ex3 -v

Check that job is in-queue/running

sacct

While this analysis is running you can move on to the next exercise.

Exercise 4: saving/sharing results on ida.fairdata.fi

Setup ida to connect to your designated project by editing the .ida-config file in your home directory.

Login to Puhti web-interface by following the link: Puhti web-interface. Navigate to you "Home Directory". Click "Show Dotfiles" checbox at the top of your file list. Locate .ida-config file and start editing by clicking on the menu next to the file name and selecting Edit. If you don't have .ida-config file create it by clicking "New File" at the top panel.

In the .ida-config file add the following two lines. In this exercise we use project 2002989 but you can use any CSC project you have access to. Save the file by clicking Save button at the top left.

IDA_PROJECT="2002989"
IDA_HOST="https://ida.fairdata.fi"

Now open the same file in the terminal with unix less. You should see the added lines in the .ida-config file. Exit less by typing q:

less ~/.ida-config

Now you can upload your results to ida.fairdata.fi with ida module. Move to your result directory and check that your have M15_ex3.tar.gz (or similar) file ready for saving/sharing.

cd $data/results
ls -l

Load ida module and start uploading (change my_dir to the name of subdirectory you wish to create and load your data to on Fairdata IDA):

module load ida
ida upload my_dir/M15small.tar.gz M15small.tar.gz

When the upload completes you should see the uploaded file appear under 2002989+/my_dir/.

You can also save results to IDA by first dowloading them to your computer:

Login to Puhti web-interface by following the link: Puhti web-interface
Navigate to your result directory /scratch/project_2002989/username/results. Locate your result file (e.g. M15small.tar.gz) click on the menu and select Download
Open Fairdata IDA in your web-browser and login.
Navigate to the Staging 2002989 project (or your designated project) and your subdirectory. Upload the results by clicking the "+" sign at the top panel and selecting the dowloaded M15small.tar.gz file.

End notes

This completes Running Lazypipe on Puhti module.

For more information see Lazypipe User Guides

Wiki

Lazypipe / exercises / Running-Lazypipe-on-Puhti.v3