getSegment doesn't work for C-regions

Issue #89 resolved
Jason Vander Heiden created an issue

The allele, gene and family regexs don’t behave as expected for C region genes (IGHD specifically):

> countGenes(ExampleDb, gene="c_call", mode="allele")
# A tibble: 4 x 3
  gene  seq_count seq_freq
  <chr>     <int>    <dbl>
1 IGHM        719    0.360
2 IGHG        650    0.325
3 IGHA        372    0.186
4 IGH         259    0.130

> countGenes(ExampleDb, gene="c_call", mode="asis")
# A tibble: 4 x 3
  gene  seq_count seq_freq
  <chr>     <int>    <dbl>
1 IGHM        719    0.360
2 IGHG        650    0.325
3 IGHA        372    0.186
4 IGHD        259    0.130

This is probably fixable by changing (IG[HLK]|TR[ABGD])[VDJ] to (IG[HKL][VDJADGEMC]|TR[ABGD])[VDJC]. This should also work at the gene/allele level; eg, “IGHA1*01”, “IGHG2”, “IGKC”, "IGLC1", "TRBC2", etc.

See also #86.

CC: @Scott Christley

Comments (8)

  1. Edel Aron

    So it seems like in addition to needing to updating the regex, the strip_d function in getSegment (which countGenes indirectly calls) being set to TRUE meant that the IGHD's weren't getting counted properly since they were automatically getting converted to IGH's.

  2. Edel Aron

    We decided to update strip_d instead of altering countGenes' strip_d call so that it won't strip D's when they refer to isotypes instead of duplicates (see this commit).

  3. Log in to comment