Wiki
Clone wikisparkseq / AlgorithmsAndDatastructures
Algorithms & Data structures
1. Sample, exon and position encoding
For performance reasons(see Apache Spark Tuning Data Structures) all IDs of bases, exons are encoded using Long type. There is a couple of routines for encoding/decoding provided. Currently only Ensembl gene annotation format is supported.
1.1 Position encoding
Position ID combines sample ID, chromosome ID and position within chromosome using the following formula:
val positionID = SparkSeqConversions.chrToLong(chrName) + sampleId * 1000000000000L + position
So, for instance position: (chr14,20145) of sample 89 can be encoded as (using Scala's REPL):
scala> val positionID = 14000000000L + 89 * 1000000000000L + 20145
positionID: Long = 89014000020145
or using built-in routines:
scala> import pl.elka.pw.sparkseq.conversions._
scala> val positionID = SparkSeqConversions.sampleToLong(89) + SparkSeqConversions.coordinatesToId(("chr14",20145) )
positionID: Long = 89014000020145
To decoding the position from the positionID can be done in the following way:
scala> import pl.elka.pw.sparkseq.conversions._
scala> SparkSeqConversions.idToCoordinates(SparkSeqConversions.stripSampleID(89014000020145L) )
res0: (String, Int) = (chr14,20145)
1.2 Exon encoding
Ensembl's exon name format for human(currently fully supported) is 15 character length, starts with 'ENSE' then variable number of '0' and numerical exon ID, e.g.:ENSE00001281628. For instance ENSE00001281628 exon can be encoded as follows:
scala> import pl.elka.pw.sparkseq.conversions._
scala> val geneExonID = SparkSeqConversions.sampleToLong(89) + SparkSeqConversions.ensemblExonToLong("ENSE00001281628")
geneExonID: Long = 89128162800000
and the reverse procedure:
scala> import pl.elka.pw.sparkseq.conversions._
scala> SparkSeqConversions.ensemblRegionIdToExonId(SparkSeqConversions.stripSampleID(89128162800000L) )
res1: String = ENSE00001281628
Updated