Clone wiki

sparkseq / AlgorithmsAndDatastructures

Algorithms & Data structures

1. Sample, exon and position encoding

For performance reasons(see Apache Spark Tuning Data Structures) all IDs of bases, exons are encoded using Long type. There is a couple of routines for encoding/decoding provided. Currently only Ensembl gene annotation format is supported.

1.1 Position encoding

Position ID combines sample ID, chromosome ID and position within chromosome using the following formula:

val positionID = SparkSeqConversions.chrToLong(chrName) + sampleId * 1000000000000L + position

So, for instance position: (chr14,20145) of sample 89 can be encoded as (using Scala's REPL):

scala> val positionID = 14000000000L + 89 * 1000000000000L + 20145
positionID: Long = 89014000020145

or using built-in routines:

scala> import
scala> val positionID = SparkSeqConversions.sampleToLong(89) + SparkSeqConversions.coordinatesToId(("chr14",20145) )
positionID: Long = 89014000020145

To decoding the position from the positionID can be done in the following way:

scala> import
scala> SparkSeqConversions.idToCoordinates(SparkSeqConversions.stripSampleID(89014000020145L) )
res0: (String, Int) = (chr14,20145)

1.2 Exon encoding

Ensembl's exon name format for human(currently fully supported) is 15 character length, starts with 'ENSE' then variable number of '0' and numerical exon ID, e.g.:ENSE00001281628. For instance ENSE00001281628 exon can be encoded as follows:

scala> import
scala> val geneExonID = SparkSeqConversions.sampleToLong(89) + SparkSeqConversions.ensemblExonToLong("ENSE00001281628")
geneExonID: Long = 89128162800000

and the reverse procedure:

scala> import
scala> SparkSeqConversions.ensemblRegionIdToExonId(SparkSeqConversions.stripSampleID(89128162800000L) )
res1: String = ENSE00001281628