Missing gene usage values for genes with "D" in the name

Issue #107 closed
Oscar Rodriguez created an issue

Hello,

I am using the function countGenes to calculate IGHV gene usage. I've noticed that I don't get any gene usage values for genes that have the character "D" in the name. For example, I don't get any gene usage values for IGHV3-64D. Is this the intention, and is there a way to turn this off?

I believe this is coming from the function getSegment in Genes.R. Lines 272 to 275 seem to be removing "D" if strip_d is set to True, and strip_d is set to True in the function getGene (lines 302 to 308).

Thanks!

Oscar

Comments (4)

  1. Oscar Rodriguez reporter

    If its useful, this is the code that I am using:

    gene <- countGenes(changoTableMaster, gene="v_call", groups=c("sample","cprimer"), clone="clone_id", mode="gene")
    

    I also checked the change-o (changoTableMaster) to make sure I have assignments to genes with “D“, eg:

    $ cat changeoTableMasterClonePatient_expandedStatus.tsv | grep IGHV3-64D | cut -f5 | sort | uniq -c | sort -k1,1nr
       7452 IGHV3-64D*06
       7372 IGHV3-64D*09
       2206 IGHV3-64D*08
       1021 IGHV3-64D*06,IGHV3-64D*08
        165 IGHV3-64D*06,IGHV3-64D*09
        101 IGHV3-64*03,IGHV3-64D*09
         83 IGHV3-64D*08,IGHV3-64D*09
    

  2. ssnn

    Hi Oscar,

    Yes, strip_d removes the D that signals a duplicate gene. In the current implementation of countGenes, it is not possible to set strip_d to FALSE. If you want to count the duplicate genes separately, you could do a previous step where you create the gene names with getGene and strip_d=FALSE, then use countGenes with mode='asis'.

    > db <- data.frame(
    +     list(v_call=c("IGHV3-64D*06","IGHV3-64*06"))
    + )
    > countGenes(db, gene="v_call")
    # A tibble: 1 × 3
      gene     seq_count seq_freq
      <chr>        <int>    <dbl>
    1 IGHV3-64         2        1
    > db[["v_gene"]] <- getGene(db[["v_call"]], strip_d = FALSE)
    > countGenes(db, gene="v_gene", mode="asis")
    # A tibble: 2 × 3
      gene      seq_count seq_freq
      <chr>         <int>    <dbl>
    1 IGHV3-64          1      0.5
    2 IGHV3-64D         1      0.5
    

  3. Log in to comment