Running Lazypipe on Puhti

Welcome to the Running Lazypipe on Puhti. This module is intended for practicing basic NGS analysis with Lazypipe 2.1 on CSC Puhti supercluster. In this module you will learn to:

set up working environment on CSC Puhti
run Lazypipe analysis with lazypipe.pl
run Lazypipe analysis with sbatch-lazypipe
save/share your results with Fairdata IDA

Prerequisites:

account on CSC Puhti
Lazypipe 2.1 CSC module
no experience with Unix command line or NGS analysis is required

For more information please refer to these guides:

Exercise 1: setting up working environment

In this exercise you will setup working environment for running Lazypipe on CSC Puhti.

Connecting to CSC Puhti server

Users new to Unix/CSC working environment:

Both MacOS and Windows users can access Puhti via Puhti web-interface. We recommend this option for all users that are new to Unix/CSC working environment:

Login to Puhti web-interface by following the link: Puhti web-interface
From the main Dashboard click on "Login node shell" to open the terminal

Experienced Unix/CSC users working on MacOS:

MacOS users can connect to Puhti with ssh client from Terminal.

start by opening Terminal utility: From Finder menu select Go and Applications. From Utilities select Terminal
From Terminal select Shell, New Window and Basic (black on white layout) or Homebrew (white on black layout).
In the terminal type (change username to your username):

ssh -X username@puhti.csc.fi -l username

Experienced Unix/CSC users working on Windows:

Download and install Putty SSH client for windows from https://www.putty.org

Start Putty. You will see a window with connection settings. In the “Host Name (or IP address)” field, type:

puhti.csc.fi

Make sure that the “Connection type” is SSH. Hit “Open”. A small window will appear where you are asked to enter your username and password.

Setting up working environment

After you have logged in to Puhti continue working in your terminal. Work through the exercises by copy-pasting or typing commands to your terminal and hitting enter.

Start by checking which projects you have access to:

csc-workspaces

As an example we will use project project_2002989. However, you can use any project you have access to.

CSC supercomputers have three main disk areas: home, projappl and scratch. For a short intro see CSC Disk Areas. We will create directories for data in the scratch and one directory for the Lazypipe application in the projappl disk areas. In the following examples we will use variable $USER that will be automatically substituted for your username. Thus, you can copy-paste the example commands without editing to your terminal.

Create data directory named $USER in the project´s scratch disk area. Create subdirectories data and results:

mkdir /scratch/project_2002989/$USER/
mkdir /scratch/project_2002989/$USER/data
mkdir /scratch/project_2002989/$USER/results

Create application directory named $USER in the project´s projappl disk area. Create subdirectory named "lazypipe":

mkdir /projappl/project_2002989/$USER/
mkdir /projappl/project_2002989/$USER/lazypipe

It is convenient to define environment variables referring to your directories. To do this you will need to edit .bashrc file in your home directory. In the Puhti web-interface navigate to your "Home Directory". Click "Show Dotfiles" checbox at the top of your file list. Locate .bashrc file and start editing by clicking on the menu next to the file name and selecting Edit.

In the .bashrc file add the following two lines and save the file by clicking Save button at the top left.

export data=/scratch/project_2002989/$USER
export lazypipe=/projappl/project_2002989/$USER/lazypipe

Now open the same file in the terminal with unix less. To navigate less use up/down arrows, to exit less type q. You should see the added lines in the .bashrc file.

less ~/.bashrc

Load your variables (will autoload on the next login):

source ~/.bashrc

You should now have variables $data and $lazypipe available on the command line. Check that these variables exist and point to the right directories by using echo:

echo $data
echo $lazypipe

These should print full paths to your data and application directories:

/scratch/project_2002989/username/data
/projappl/project_2002989/username/lazypipe

Now check that the directories exist by listing directory content with ls (note that \$lazypipe remains empty at this point) :

ls $data
ls $lazypipe

Loading modules and creating config.yaml

Go to your Lazypipe application directory and load required modules

cd $lazypipe
module load r-env-singularity
module load biokit
module load lazypipe

Copy default config.yaml file to your application directory. Then set tmpdir Lazypipe variable to point to your application directory and set taxonomy variable to point to taxonomy subdirectory:

cd $lazypipe
cp /appl/soft/bio/lazypipe/2.1/lazypipe/config.yaml config.yaml
echo tmpdir:  "$lazypipe" >> config.yaml
echo taxonomy:  "$lazypipe/taxonomy" >> config.yaml

Testrun lazypipe.pl: the command should print command-line usermanual:

lazypipe.pl -h

Exercise 2: Running Lazypipe with lazypipe.pl

In this exercise you will get familiar with basic Lazypipe commands.

According to CSC user policy: “The login nodes can be used for light pre- and postprocessing, compiling applications and moving data. All other tasks are to be done on the compute nodes using the batch job system.”

We will run this example on the login node because it is small scale.

Start by copying sample PE data to your $data/data directory:

cp /appl/soft/bio/lazypipe/2.1/lazypipe/data/samples/M15small_R* $data/data/

Run read preprocessing:

cd $lazypipe
lazypipe.pl -1 $data/data/M15small_R1.fastq --pipe pre -t 4 -v

Run assembling:

lazypipe.pl -1 $data/data/M15small_R1.fastq --pipe ass -t 4 -v

Run read realignment to the created assembly:

lazypipe.pl -1 $data/data/M15small_R1.fastq --pipe rea -t 4 -v

Run 1st round annotation with SANSparallel against UniProt TrEMBL:

lazypipe.pl -1 $data/data/M15small_R1.fastq --pipe ann --ann sans -t 4 -v

Generate reports (mustdo before 2nd round annotation):

lazypipe.pl -1 $data/data/M15small_R1.fastq --pipe rep -t 4 -v

Run 2nd round annotation with Blastn against GeneBank virus genomes:

lazypipe.pl -1 $data/data/M15small_R1.fastq -p blastv -t 4 -v

Generate assembly stats, pack for sharing and clean up temporary files:

lazypipe.pl -1 $data/data/M15small_R1.fastq -p stats,pack,clean -t 4 -v

Your results are output to $res/$sample, where $res is the root result directory and $sample is the input sample name. By default results are output to results/read1-filename. Check the content of your result directory:

ls -l results
ls -l results/M15small*

Exercise 3: Running Lazypipe with sbatch-lazypipe

sbatch-lazypipe is a help tool that automatically generates a configuration file and a batch job file for a Lazypipe run and submits the job to batch job system of Puhti. The command uses the same command line options as the lazypipe.pl command. In addition sbatch-lazypipe asks user to define batch job resources (account, run time, memory, number of cores). The required memory and time will depend on the size of your input library. As a rule of thumb we recommend using 5GB of memory per core (e.g. 80GB for 16 cores).

Run default analysis for M15small_R1.fastq sample and output results to $data/results/M15_ex3. Note that in the following call main pipeline steps (pre,ass,rea,ann,rep,stats,pack,clean) are referred using main tag. When prompted, set run-time to 5 min (0:5:0), memory to default (~32 GB) and cores to 8.

sbatch-lazypipe -1 $data/data/M15small_R1.fastq --pipe main -r $data/results -s M15_ex3 -v

Check that job is in-queue/running

sacct

After your job completes run 2nd round annotation with blastn. Redo reporting and repack results. Make sure you specify the same --res dir and --sample dir. When prompted, set run-time to 5 min (0:5:0), memory to default (~32 GB) and cores to 8.

sbatch-lazypipe -1 $data/data/M15small_R1.fastq --pipe blastv,rep,pack -r $data/results -s M15_ex3 -v

While your 2nd round annotation is running start another job that will output results to a different location. Use Minimap2 for 1st round annotation and blastn against GeneBank viruses for the 2nd round annotation. For this job we recommend setting run-time to 1 h (1:00:0), memory to 120 GB and number of cores to 32.

sbatch-lazypipe -1 $data/data/M15small_R1.fastq --pipe main,blastv --ann minimap -r $data/results -s M15_ex3.2 -v

While this analysis is running you can move on the next exercise.

Exercise 4: saving/sharing results on ida.fairdata.fi

Setup ida to connect to your designated project by editing the .ida-config file in your home directory.

Login to Puhti web-interface by following the link: Puhti web-interface. Navigate to you "Home Directory". Click "Show Dotfiles" checbox at the top of your file list. Locate .ida-config file and start editing by clicking on the menu next to the file name and selecting Edit. If you don't have .ida-config file create it by clicking "New File" at the top panel.

In the .ida-config file add the following two lines. In this exercise we use project 2002989 but you can use any CSC project you have access to. Save the file by clicking Save button at the top left.

IDA_PROJECT="2002989"
IDA_HOST="https://ida.fairdata.fi"

Now open the same file in the terminal with unix less. You should see the added lines in the .ida-config file. Exit less by typing q:

less ~/.ida-config

Now you can upload your results to ida.fairdata.fi with ida module. Move to your result directory and check that your have M15_ex3.tar.gz (or similar) file ready for saving/sharing.

cd $data/results
ls -l

Load ida module and start uploading (change my_dir to the name of subdirectory you wish to create and load your data to on Fairdata IDA):

module load ida
ida upload my_dir/M15_ex3.tar.gz M15_ex3.tar.gz

When the upload completes you should see the uploaded file appear under 2002989+/my_dir/.

You can also save results to IDA by first dowloading them to your computer:

Login to Puhti web-interface by following the link: Puhti web-interface
Navigate to your result directory /scratch/project_2002989/username/results. Locate your result file (e.g. M15_ex3.tar.gz) click on the menu and select Download
Open Fairdata IDA in your web-browser and login.
Navigate to the Staging 2002989 project (or your designated project) and your subdirectory. Upload the results by clicking the "+" sign at the top panel and selecting the dowloaded M15_ex3.tar.gz file.

End notes

This completes Running Lazypipe on Puhti module.

For more information see Lazypipe User Guides

Wiki

Lazypipe / exercises / Running-Lazypipe-on-Puhti.v2