Wiki
Clone wikiLazypipe / exercises / Running-Lazypipe-on-Puhti.v3
Running Lazypipe on Puhti
Welcome to the Running Lazypipe on Puhti. This module is intended for practicing basic NGS analysis with Lazypipe 3.0 on CSC Puhti supercluster. In this module you will learn to:
- set up working environment on CSC Puhti
- run Lazypipe analysis with lazypipe.pl
- run Lazypipe analysis with sbatch-lazypipe
- save/share your results with Fairdata IDA
Prerequisites:
- account on CSC Puhti
- Lazypipe 3.0 CSC module
- no experience with Unix command line or NGS analysis is required
For more information please refer to these guides:
Exercise 1: setting up working environment
In this exercise you will setup working environment for running Lazypipe on CSC Puhti.
Connecting to CSC Puhti server
Users new to Unix/CSC working environment
Both MacOS and Windows users can access Puhti via Puhti web-interface. We recommend this option for all users that are new to Unix/CSC working environment:
- Login to Puhti web-interface by following the link: Puhti web-interface
- From the main Dashboard click on "Login node shell" to open the terminal
Users familiar with Unix/CSC working on MacOS
MacOS users can connect to Puhti with ssh client from Terminal.
- start by opening Terminal utility
- From Terminal select Shell, New Window and Basic (black on white layout) or Homebrew (white on black layout).
- In the terminal type (change username to your username):
ssh -X username@puhti.csc.fi -l username
Users familiar with Unix/CSC working on Windows
Download and install Putty SSH client for windows from https://www.putty.org
Start Putty. You will see a window with connection settings. In the “Host Name (or IP address)” field, type:
puhti.csc.fi
Setting up working environment
After you have logged in to Puhti continue working in your terminal. Work through the exercises by copy-pasting or typing commands to your terminal and hitting enter.
Start by checking which projects you have access to:
csc-workspaces
As an example we will use project project_2002989. However, you can use any project you have access to.
CSC supercomputers have three main disk areas: home, projappl and scratch.
For a short intro see CSC Disk Areas.
We will create directories for data in the scratch
and one directory for the Lazypipe application in the projappl disk areas.
In the following examples we will use variable $USER
that will be automatically substituted for your username.
Thus, you can copy-paste the example commands without editing to your terminal.
Create data directory named $USER
in the project´s scratch disk area.
Create subdirectories data
, results
and hostgen
:
mkdir /scratch/project_2002989/$USER/ mkdir /scratch/project_2002989/$USER/data mkdir /scratch/project_2002989/$USER/results mkdir /scratch/project_2002989/$USER/hostgen
Create application directory named $USER
in the project´s projappl disk area.
Create subdirectory named "lazypipe":
mkdir /projappl/project_2002989/$USER/ mkdir /projappl/project_2002989/$USER/lazypipe
It is convenient to define environment variables referring to your directories.
To do this you will need to edit .bashrc
file in your home directory.
In the Puhti web-interface navigate to your "Home Directory".
Click "Show Dotfiles" checbox at the top of your file list.
Locate .bashrc
file and start editing by clicking on the menu next to the file name and selecting Edit.
In the .bashrc
file add the following two lines and
save the file by clicking Save button at the top left.
export data=/scratch/project_2002989/$USER export lazypipe=/projappl/project_2002989/$USER/lazypipe
Now open the same file in the terminal with unix less.
To navigate less use up/down arrows, to exit less type q.
You should see the added lines in the .bashrc
file.
less ~/.bashrc
Load your variables (will autoload on the next login):
source ~/.bashrc
You should now have variables $data
and $lazypipe
available on the command line.
Check that these variables exist and point to the right directories by using echo:
echo $data echo $lazypipe
These should print full paths to your data and application directories:
/scratch/project_2002989/$USER /projappl/project_2002989/$USER/lazypipe
Now check that the directories exist by listing directory content with ls (note that \$lazypipe remains empty at this point) :
ls $data ls $lazypipe
Loading modules and creating config.yaml
Go to your Lazypipe application directory and load required modules
cd $lazypipe module load r-env-singularity module load biokit module load lazypipe
Copy default config.yaml
file to your application directory.
Then set tmpdir Lazypipe variable to point to your application directory
and set taxonomy variable to point to taxonomy subdirectory:
cd $lazypipe cp /appl/soft/bio/lazypipe/3.0/lazypipe/config.yaml config.yaml echo tmpdir: "$lazypipe" >> config.yaml echo taxonomy: "$lazypipe/taxonomy" >> config.yaml
Testrun lazypipe.pl: the command should print command-line usermanual:
lazypipe.pl -h
Exercise 2: Running Lazypipe with lazypipe.pl
In this exercise you will get familiar with basic Lazypipe commands.
According to CSC user policy: “The login nodes can be used for light pre- and postprocessing, compiling applications and moving data. All other tasks are to be done on the compute nodes using the batch job system.”
We will run this example on the login node because it is small scale.
Start by copying sample PE data to your $data/data
directory:
cp /appl/soft/bio/lazypipe/3.0/lazypipe/data/samples/M15small_R* $data/data/
Run read preprocessing:
cd $lazypipe
lazypipe.pl -1 $data/data/M15small_R1.fastq --pipe pre -t 8 -v
Run host filtering. Start by downloading Neovison vison genome. Note that filtering host reads with a newly downloaded genome will take some time to index the genome. If short for time you can skip this step.
mkdir -p $data/hostgen
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG.v01/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -P $data/hostgen/
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -t 8 -v
Run assembling:
lazypipe.pl -1 $data/data/M15small_R1.fastq -p ass -t 8 -v
Run read realignment to the created assembly:
lazypipe.pl -1 $data/data/M15small_R1.fastq -p rea -t 8 -v
Run 1st round annotation with Minimap2 against minimap.refseq database defined in config.yaml
(by default this points to RefSeq archaea+bacteria+viruses):
perl lazypipe.pl -1 $data/data/M15small_R1.fastq -p ann1 --ann1 minimap.refseq -t 8 -v
Run 1st round annotation with SANSparallel against UniProt TrEMBL. Note that SANSparallel runs on a remote server and requires internet connection. Append results to Minimap2 annotations from the previous step:
perl lazypipe.pl -1 $data/data/M15small_R1.fastq -p ann1 --ann1 sans --append -t 8 -v
Run 2nd round annotation. In the second round you can target archaeal+bacterial (=ab), bacteriophage (=ph), viral (=vi) and unmapped (=un) contigs, based on labeling from the 1st round. Local databases for the 2nd round annotations are defined in ann2.databases
section of the config.yaml
. For example, to map viral contigs with BLASTN against blastn.vi.refseq (RefSeq viruses) type:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann2 --ann2 blastn.vi.refseq -t 8 -v
Use annotation strategies defined in the config.yaml
to run common combinations of 1st and 2nd round annotations. For example, --anns vi.refseq
annotation strategy is equivalent to --ann1 minimap.vi.refseq --ann2 blastn.vi.refseq
. To run type:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --anns abv.refseq -t 8 -v
Generate reports based on created annotations:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p rep -t 8 -v
Generate assembly stats, pack for sharing and remove temporary files:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p stats,pack,clean -t 8 -v
Use main
tag to refer to the main pipeline steps (pre,ass,rea,ann,rep,stats,pack,clean). For example, to run the whole pipeline annotating only viral sequences against RefSeq type:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p main --anns vi.refseq -t 8 -v --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz
Your results are output to $res/$sample
,
where $res
is the root result directory and $sample
is the input sample name.
By default results are output to results/read1-filename
.
Check the content of your result directory:
ls -l results
ls -l results/M15small*
Exercise 3: Running Lazypipe with sbatch-lazypipe
sbatch-lazypipe is a help tool that automatically generates a configuration file and a batch job file for a Lazypipe run and submits the job to batch job system of Puhti. The command uses the same command line options as the lazypipe.pl command. In addition sbatch-lazypipe asks user to define batch job resources (account, run time, memory, number of cores). The required memory and time will depend on the size of your input library. As a rule of thumb we recommend using 5GB of memory per core (e.g. 80GB for 16 cores).
Run default analysis for M15small_R1.fastq
sample and output results to $data/results/M15_ex3
. Note that in the following call main pipeline steps (pre,ass,rea,ann,rep,stats,pack,clean) are referred using main tag. When prompted, set run-time to 1 hour (1:00:0), memory to default (~32 GB) and cores to 8.
sbatch-lazypipe -1 $data/data/M15small_R1.fastq -p main --anns abv.refseq -r $data/results -s M15_ex3 -v
Check that job is in-queue/running
sacct
While this analysis is running you can move on to the next exercise.
Exercise 4: saving/sharing results on ida.fairdata.fi
Setup ida to connect to your designated project by editing the .ida-config
file in your home directory.
Login to Puhti web-interface by following the link: Puhti web-interface.
Navigate to you "Home Directory". Click "Show Dotfiles" checbox at the top of your file list.
Locate .ida-config
file and start editing by clicking on the menu next to the file name and selecting Edit.
If you don't have .ida-config
file create it by clicking "New File" at the top panel.
In the .ida-config
file add the following two lines.
In this exercise we use project 2002989 but you can use any CSC project you have access to.
Save the file by clicking Save button at the top left.
IDA_PROJECT="2002989" IDA_HOST="https://ida.fairdata.fi"
Now open the same file in the terminal with unix less.
You should see the added lines in the .ida-config
file.
Exit less by typing q:
less ~/.ida-config
Now you can upload your results to ida.fairdata.fi with ida module. Move to your result directory and check that your have M15_ex3.tar.gz (or similar) file ready for saving/sharing.
cd $data/results ls -l
Load ida module and start uploading (change my_dir to the name of subdirectory you wish to create and load your data to on Fairdata IDA):
module load ida ida upload my_dir/M15small.tar.gz M15small.tar.gz
When the upload completes you should see the uploaded file appear under 2002989+/my_dir/.
You can also save results to IDA by first dowloading them to your computer:
- Login to Puhti web-interface by following the link: Puhti web-interface
- Navigate to your result directory /scratch/project_2002989/username/results. Locate your result file (e.g. M15small.tar.gz) click on the menu and select Download
- Open Fairdata IDA in your web-browser and login.
- Navigate to the Staging 2002989 project (or your designated project) and your subdirectory. Upload the results by clicking the "+" sign at the top panel and selecting the dowloaded M15small.tar.gz file.
End notes
This completes Running Lazypipe on Puhti module.
For more information see Lazypipe User Guides
Updated