Option to parse UMI sequence and add to read header

Some protocols, esp. single cell protocols, use Unique Molecular Identifers (UMIs) to detect duplicate molecules. It is common practice to add these the read id (as is done by Illumina's bcl2fastq software).

Illumina's format for read id (as of bcl2fastq v 2.19) is:

@Instrument:RunID:FlowCellID:Lane:Tile:X:Y:UMI ReadNum:FilterFlag:0:IndexSequence or SampleNumber

I'm not sure the best way to specify options for this, but perhaps provide an additional option to specify a mask for each read that uses codes to separate index similar to the --use-bases-mask parameter for bcl2fastq (I), umi (U), and read (N) bases. For example:

--read1_mask N* --read2_mask I6U6

This specifies that read1 is the main read for it's entire length (N*) and read2 consists of six bases of index followed by six bases of UMI.

Comments (1)