- edited description
getSegment doesn't work for C-regions
The allele, gene and family regexs don’t behave as expected for C region genes (IGHD specifically):
> countGenes(ExampleDb, gene="c_call", mode="allele")
# A tibble: 4 x 3
gene seq_count seq_freq
<chr> <int> <dbl>
1 IGHM 719 0.360
2 IGHG 650 0.325
3 IGHA 372 0.186
4 IGH 259 0.130
> countGenes(ExampleDb, gene="c_call", mode="asis")
# A tibble: 4 x 3
gene seq_count seq_freq
<chr> <int> <dbl>
1 IGHM 719 0.360
2 IGHG 650 0.325
3 IGHA 372 0.186
4 IGHD 259 0.130
This is probably fixable by changing (IG[HLK]|TR[ABGD])[VDJ]
to (IG[HKL][VDJADGEMC]|TR[ABGD])[VDJC]
. This should also work at the gene/allele level; eg, “IGHA1*01”, “IGHG2”, “IGKC”, "IGLC1", "TRBC2", etc.
See also #86.
CC: @Scott Christley
Comments (8)
-
reporter -
reporter -
assigned issue to
-
assigned issue to
-
-
assigned issue to
-
assigned issue to
-
- changed status to open
-
- changed milestone to next release
-
So it seems like in addition to needing to updating the regex, the
strip_d
function ingetSegment
(whichcountGenes
indirectly calls) being set to TRUE meant that the IGHD's weren't getting counted properly since they were automatically getting converted to IGH's. -
We decided to update
strip_d
instead of alteringcountGenes
'strip_d
call so that it won't strip D's when they refer to isotypes instead of duplicates (see this commit). -
- changed status to resolved
The updated regex and strip_d function seems to have done the trick
- Log in to comment