- marked as enhancement
Option to parse UMI sequence and add to read header
Issue #24
new
Some protocols, esp. single cell protocols, use Unique Molecular Identifers (UMIs) to detect duplicate molecules. It is common practice to add these the read id (as is done by Illumina's bcl2fastq
software).
Illumina's format for read id (as of bcl2fastq v 2.19) is:
@Instrument:RunID:FlowCellID:Lane:Tile:X:Y:UMI ReadNum:FilterFlag:0:IndexSequence or SampleNumber
I'm not sure the best way to specify options for this, but perhaps provide an additional option to specify a mask for each read that uses codes to separate index similar to the --use-bases-mask
parameter for bcl2fastq
(I), umi (U), and read (N) bases. For example:
--read1_mask N* --read2_mask I6U6
This specifies that read1 is the main read for it's entire length (N*
) and read2 consists of six bases of index followed by six bases of UMI.
Comments (1)
-
reporter - Log in to comment