calcObservedMutations isn't handling germline gaps correctly

Issue #138 resolved
Jason Vander Heiden created an issue

These should all work, but calcObservedMutations fails when there are gap characters in the germline.

> calcObservedMutations("ATGCTT", "ATGCCC")
seq_r seq_s 
    1     1 
> calcObservedMutations("ATG-TT", "ATGCCC")
seq_r seq_s 
    1     1 
> calcObservedMutations("ATGGTT", "ATG-CC")
 Error in calcObservedMutations("ATGGTT", "ATG-CC") : 
  ncol(mutations_array_raw) == length(mutations_pos) is not TRUE 
> calcObservedMutations("ATG-TT", "ATG-CC")
 Error in calcObservedMutations("ATG-TT", "ATG-CC") : 
  ncol(mutations_array_raw) == length(mutations_pos) is not TRUE 

Comments (4)

  1. Jason Vander Heiden reporter

    I suspect this is the same bug, so I’m going to put this into this ticket:

    These clonal consensus sequences, output by collapseClones with breakTiesStochastic=TRUE, produce the same error:

      SEQ> GATATCGAGATGACCCAGTCTCCACTTTCCCTGCCCGCCTCTGCTGGGGAGAGGGTCACCATCTCCTGCCGGGCGAGCCAGAGTATCCAG...GGA....A.G..AATAACTATTTAACTTGGTACGTACAGATGCCAGGCCAAGCTCCACGACTCTTGATCCATGTGGGC.....................TCTAGTCGGGCCGCAGGGGTCCCG...TCCAGGTTTAGTGGCGGTGGA......TCTGGCACAGTTTTCACCCTGGTCATCGGTGGACTGGAGACTGTTGATGCTGGCATTTATTTCTGT 
     GERM> GACATCCTGATGACCCAGTCTCCACTCTCCCTGTCTGTCTCCCTTGGAGACACAGTCTCCATCACCTGCCGGGCAAGTCAGAGCATTCT.CATA.....A..GG.AGCAGCTATTTGGATTGGTACCTGCAGAAACCAGGGAAGTCTCCTCAGCTCCTGATCTATTCTGCT.....................TCCAGTTGGGCCTGCGGGGTCCCA...GCAAGGTTCAGTGGCAGTGGA......TCTGGCACAGATTTCACTCTCACAATCAGCAGAGTGGAAGCTGAAGATTTTGGAATTTACTACTGT 
    

    This weird consensus bit starting at position 90 looks awfully strange, but it alone doesn’t produce the error:

     SEQ> G...GGA....A.G..
    GERM> .CATA.....A..GG.
    

    Test code:

    x <- "GATATCGAGATGACCCAGTCTCCACTTTCCCTGCCCGCCTCTGCTGGGGAGAGGGTCACCATCTCCTGCCGGGCGAGCCAGAGTATCCAG...GGA....A.G..AATAACTATTTAACTTGGTACGTACAGATGCCAGGCCAAGCTCCACGACTCTTGATCCATGTGGGC.....................TCTAGTCGGGCCGCAGGGGTCCCG...TCCAGGTTTAGTGGCGGTGGA......TCTGGCACAGTTTTCACCCTGGTCATCGGTGGACTGGAGACTGTTGATGCTGGCATTTATTTCTGT"
    y <- "GACATCCTGATGACCCAGTCTCCACTCTCCCTGTCTGTCTCCCTTGGAGACACAGTCTCCATCACCTGCCGGGCAAGTCAGAGCATTCT.CATA.....A..GG.AGCAGCTATTTGGATTGGTACCTGCAGAAACCAGGGAAGTCTCCTCAGCTCCTGATCTATTCTGCT.....................TCCAGTTGGGCCTGCGGGGTCCCA...GCAAGGTTCAGTGGCAGTGGA......TCTGGCACAGATTTCACTCTCACAATCAGCAGAGTGGAAGCTGAAGATTTTGGAATTTACTACTGT"
    calcObservedMutations(x, y, regionDefinition=IMGT_V)
    
    Error in calcObservedMutations(x, y, regionDefinition = IMGT_V):
      ncol(mutations_array_raw) == length(mutations_pos) is not TRUE
    

  2. Julian Zhou

    Around Line 1840 in mutationProfiling.R, a character vector in which each entry is a single string, such as:

    c_germlineSeq_codons
    [1] "-" "C" "C" "-" "C" "C"
    

    needs to be converted into a character vector in which each entry is a triplet string, such as:

    > c_germlineSeq_codons
    [1] "-CC" "-CC"
    

    Previously, the code was

    c_germlineSeq_codons <- strsplit(gsub("([[:alnum:]]{3})", "\\1 ", c2s(c_germlineSeq_codons)), " ")[[1]]
    

    However, [[:alnum:]]{3} failed to capture non-ATGC characters such as . and -, such as the gap in -CC.

    New regex in [commit 2c7e782](https://bitbucket.org/kleinstein/shazam/commits/2c7e782ddf32d5e9bf5f49120740aae1d20cbedd) fixed this bug.

    Good catch!

  3. Log in to comment