Wiki

Clone wiki

sequencing / Applications_QC

Here we will describe 3 ways to obtain a full report on the initial quality of raw reads (fastq files) using QCFasta function included with Anduril. QCFasta not only creates a web report but also performs quality trimming of reads and discarding of low quality samples.

For all three cases we will assume that you have a CSV file with a column named Key (which has the ids of your samples) and a column named File (with the paths to the fastq file). If you have paired end reads you will need one csv file for the reads and one for the mates. Example fastq files can be found in the testcases of QCFasta (https://bitbucket.org/anduril-dev/sequencing/src/anduril2/functions/QCFasta/testcases/). The reads should be names "sample1_1.fq" etc. and the mates "sample1_2.fq" etc. or the suffices changed below.

QCFasta can use both compressed or uncompressed fastq files and you can choose if you want the output fastq files to be compressed as well.

So the most simple way to trim your files and create the report is the following:

#!/usr/bin/env anduril
//$OPT --wrapper slurm-prefix
//$OPT --pipe "tee $ANDURIL_EXECUTION_DIR/_log"
//$OPT --pipe "anduril-pager --ls -t --title $ANDURIL_EXECUTION_DIR"
//$OPT --threads 4
//$POST anduril-result-browser --unused "$ANDURIL_EXECUTION_DIR"

import anduril.builtin._                                                                                                                                                
import anduril.microarray._                                                                                                                                             
import anduril.sequencing._                                                                                                                                             
import anduril.tools._                                                                                                                                                  
import org.anduril.runtime._                                                                                                                                            

object qcfasta {                                                                                                                                                        
  val reads = INPUT(path="files/reads.csv")                                                                                                                 
  val mates = INPUT(path="files/mates.csv")                                                                                                                 

  val allReads = CSV2Array(in = reads.out, keys="column")                                                                                                               
  val allMates = CSV2Array(in = mates.out, keys="column")                                                                                                               

  val qc = QCFasta( reads = allReads.out,                                                                                                                               
                    mates= allMates.out,                                                                                                                                
                    tool            = "trimmomatic",                                                                                                                    
                    headcrop        = 10,                                                                                                                               
                    slidingWindow   = "5:20",                                                                                                                           
                    adapterSeq      = "TruSeq2-PE.fa",                                                                                                                  
                    gzip            = true,                                                                                                                             
                    minQuality      = 20,                                                                                                                               
                    trailing        = 20,                                                                                                                               
                    percent         = 70,                                                                                                                               
                    threads         = 6,
                    mateKey         = "_2",
                    readKey         = "_1")
}
In this example we are using Trimmomatic for trimming, but in the QCFasta documentation you can see that you could use FastX or trimGalore as well. We have chosen also to trim the first 10 bases, use a sliding window to check the quality of the bases we want to trim, trim bases at the end of the read with less thant 20 phred score, we define that if at least 70% of the reads in the sample are not of good quality then we will discard the whole file from further steps (the file is not erased or lost, it is just not included in the output array). We are also allowing to have 6 inner threads, this threads are for trimmomatic and fastqc to use (on top of the threads that you have define at the beginning of your script which are Anduril specific). Finally we tell QCFasta how to distinguish between read and mate files (this depends on the naming of your fastq files).

Once you have run this, you will get a report like (some file). If you have no idea how to set the parameters above and you will like to take a look at the quality before any trimming is done you can do the following:

object qcfasta {
  val reads = INPUT(path="files/reads.csv")      
  val mates = INPUT(path="files/mates.csv")      

  val allReads = CSV2Array(in = reads.out, keys="column")
  val allMates = CSV2Array(in = mates.out, keys="column")

  val folders = NamedMap[BinaryFile]("folders")

    for ((key,value) <- iterArray(allRead.out)) {
    withName(key){

        val pre = FastQC(read    = allReads.out(key),
                     mate    = allMates.out(key),
                     clean   = false,
                     suffix  = "preStats")

        folders[key] = pre.folder
        }
    }
  //Code will be inserted here later    
}
Now you can run the previous step to see the FastQC output of each of the samples. You can take a look at each of the figures produced by FastQC and decide which parameters you will work best for your samples for the trimming steps. Once you have decided then insert the qcFasta code like the following with an extra input which are the fastqc folders, so QCFasta will skip the initial fastqc run, but still will include the plots in the web report:

  val qc = QCFasta( reads     = allReads.out,
                    mates     = allMates.out,
                    fastQCfolders = folders,
                    tool          = "trimmomatic",
                    headcrop      = 10,
                    slidingWindow = "5:20",
                    adapterSeq    = "TruSeq2-PE.fa",
                    gzip          = true,
                    minQuality    = 20,
                    trailing      = 20,
                    percent       = 70,
                    threads       = 6,
                    mateKey       = "_2",
                    readKey       = "_1")
}

Updated