Clone wiki

sparkseq / Examples

Examples

All the examples depicted here can be run using Apache Spark's REPL.

One sample analyses

1. Calculate how many nucleotides in your BAM file have coverage greater or equal to 10:

import pl.elka.pw.sparkseq.seqAnalysis.SparkSeqAnalysis
val seqAnalysis = new SparkSeqAnalysis(sc,"pathToYourBAM",1,1,1)
seqAnalysis.getCoverageBase().filter(p=>(p._2>=10)).count()

2. Compute coverage for all genomic regions from a given BED file and sort them descending by coverage:

import pl.elka.pw.sparkseq.seqAnalysis.SparkSeqAnalysis
val genExonsMapB = sc.broadcast(SparkSeqConversions.BEDFileToHashMap(sc,"pathToYourBED" ))
val seqAnalysis = new SparkSeqAnalysis(sc,"pathToYourBAM",1,1,1)
val covRegion = seqAnalysisControl.getCoverageRegion(genExonsMap).map(t=>(t._2,t._1)).sortByKey(false)

3. Compute coverage for all bases and find the top 10:

import pl.elka.pw.sparkseq.seqAnalysis.SparkSeqAnalysis
val seqAnalysis = new SparkSeqAnalysis(sc,"pathToYourBAM",1,1,1)
val topCov = seqAnalysis.getCoverageBase().map(t=>(t._2,t._1)).sortByKey(false).map(t=>(t._2,t._1)).take(10)

You can of course use Picard API to manipulate directly your samples:

4. Calculate the number of short reads in a BAM file that were mapped, were not duplicates and have mapping quality greater than 19:

import pl.elka.pw.sparkseq.seqAnalysis.SparkSeqAnalysis
val seqAnalysis = new SparkSeqAnalysis(sc,"pathToYourBAM",1,1,1)
seqAnalysis.bamFile.map(r=>r._2)
     .filter(r=> r._2.get.getReadUnmappedFlag==false && r._2.get.getDuplicateReadFlag==false && r._2.get.getMappingQuality>19 )
     .count()

Multiple sample analyses

1. Create SparkSeqAnalysis object for 3 samples.

import pl.elka.pw.sparkseq.seqAnalysis.SparkSeqAnalysis
val normArray=Array(1.0,8622606.0/19357579.0,8622606.0/14087644)
val caseId = Array(2,3)
val seqAnalysisCase = new SparkSeqAnalysis(sc,pathFam1+"/Case/Sample_1_sort.bam",1,normArray(0),8)
var id = 1
for(i<-caseId){
   seqAnalysisCase.addBAM(sc,pathFam1+"/Case/Sample_"+i.toString+"*_sort.bam",i,normArray(id))
   id+=1
}

2. Once you created a SparkSeqAnalysis object for multiple samples you can for instance compute coverage for all of them at base level and get count vectors for each base across samples:

import pl.elka.pw.sparkseq.conversions._
val covCase = seqAnalysisCase.getCoverageBase().map(r=>(SparkSeqConversions.stripSampleID(r._1),r._2)).groupByKey()

stripSampleID operation is needed since internal SparkSeq's position and sample identification are encoded as a single Long-type value for performance reasons. See Apache Spark Tuning Data Structures and SparkSeq's algorithms & data structures.

Caching

If you would like to store results of your analyses in memory to speed up future queries you can use cache() method (see Apache Spark Programming guide for details):

import pl.elka.pw.sparkseq.seqAnalysis.SparkSeqAnalysis
val seqAnalysis = new SparkSeqAnalysis(sc,"pathToYourBAM",1,1,1)
val covCache = seqAnalysis.getCoverageBase().cache()
covCache().count() //populate memory cache
covCache().count() //run it again to see the performance boost

Updated