getClosestBy1Mer is broken when model=aa

Issue #57 resolved
Jason Vander Heiden created an issue

The final step of getClosestBy1Mer() is broken when model="aa", due to trying to match translated sequences against nucleotide sequences:

This doesn't work:

arrJunctionsDist <- arrUniqueJunctionsDist[match(names(arrJunctionsDist), 
                                                 names(arrUniqueJunctionsDist))]

Because:

> names(arrJunctionsDist)
[1] "ACGTACGTACGT" "ACGAACGTACGT" "ACGAACGTATGT" "ACGAACGTATGC" "ACGAACGTATCC" "AAAAAAAAAAAA"
> names(arrUniqueJunctionsDist)
[1] "TYVR" "TNVR" "TNVC" "TNVC" "TNVS" "KKKK"

Thus, all distances returned are NA.

Comments (10)

  1. Namita Gupta

    Seems like this has an easy fix of making sure the names of the vectors remain nucleotide, no?

  2. Jason Vander Heiden reporter

    Yeah, should be. I'm leaving now and needed the reminder. In case I feel like fixing it Sunday.

  3. Jason Vander Heiden reporter

    I think this is fixed now. I made a lot of changes to getClosestBy1Mer(), including a lot of readability refactoring.

    @namita1025 and @ssnn - It'd be great if you could both look over it to make sure it's behaving how it should. I'm not really familiar with the code, so I don't know if I missed something here.

  4. Jason Vander Heiden reporter

    The following unit test failures occurred:

    1. Failure: Test cross distToNearest with model hs1f (@test.distToNearest.R#18) 
    dist_hs1f$DIST_NEAREST[test_idx] not equal to c(...).
    6/17 mismatches (average diff: 0.346)
    [12] 0.607 - 0.070 ==  0.5371
    [13] 0.615 - 0.196 ==  0.4190
    [14] 0.196 - 0.607 == -0.4116
    [15] 0.597 - 0.615 == -0.0171
    [16]    NA - 0.196 ==      NA
    [17]    NA - 0.597 ==      NA
    
    
    2. Failure: Test cross distToNearest with model hs1f (@test.distToNearest.R#47) 
    cross_dist_hs1f$CROSS_DIST_NEAREST[1:nrow(dist_hs1f)] not equal to dist_hs1f$DIST_NEAREST.
    172/315 mismatches (average diff: 0.158)
    [34] 0.528 - 0.437 ==  0.0905
    [35] 0.528 - 0.450 ==  0.0780
    [37] 0.437 - 0.520 == -0.0833
    [38] 0.450 -    NA ==      NA
    [39] 0.528 -    NA ==      NA
    [40] 0.528 -    NA ==      NA
    [41] 0.520 -    NA ==      NA
    [46] 0.679 - 0.476 ==  0.2031
    [47] 0.476 - 0.500 == -0.0239
    

    I'm not sure if these are correct or not, and I won't have the time to check today. It could be that the test are wrong, because I switched from nchar() to stringi::stri_length because nchar(NA) returns 2, or it could be due to gap/N handling in alakazam. Not sure. A quick check with 4 sequences looks correct to me.

    (BTW - Thanks a bunch @ssnn for writing the tests.)

  5. Namita Gupta

    Fixed it, there was a bug in the rewrite. The original vector of junctions was being given the names of the vector of UNIQUE junctions, which means some names were NA, which led to a bug in the mapping unique distances back to input vector. Test runs smoothly now, though I guess there is still no test for the amino acid model...

  6. Jason Vander Heiden reporter

    That's great news! I was worried the bug was in the original code and someone would have to remake all the figures for their paper. :)

  7. Log in to comment