Improve performance of observedMutations

Issue #111 on hold
Jason Vander Heiden created an issue

Replace rbind with dplyr::bind_rows in observedMutations. Also, probably remove s2c and c2s in favor of equivalent stringi functions. And maybe cbind to dplyr::bind_cols

Profile.

Comments (9)

  1. Roy Jiang

    Will hand to Jason but can take over if changes are not affecting speed. Curious about the results...

  2. Jason Vander Heiden reporter

    I swapped out the rbind and it didn't help. Turns out it was a matrix rbind and not a data.frame rbind, so there's that.

    Looks like we need to search elsewhere for the problem. Probably in calcObservedMutations and associated helpers.

  3. Jason Vander Heiden reporter

    Some ideas:

    1) translateCodonToAminoAcid doesn't need to be a function. It's just extra overhead:

    > system.time(replicate(100000, shazam:::translateCodonToAminoAcid("TGA")))
       user  system elapsed 
      1.087   0.001   1.093 
    > system.time(replicate(100000, AMINO_ACIDS["TGA"]))
       user  system elapsed 
      0.167   0.000   0.168 
    

    2) mutationType is a private function, so we don't need to use match.arg to verify the commandline arguments.

  4. Roy Jiang

    Ok, are you assigning back to me? Any changes I would include would not be within the scope of the way this issue was defined so maybe we should close it.

    I would... 1. staying in the same function, separate the code into a A. preprocessing part for formatting the input and a B. calculation part. I think this will make maintenance easier. This is simply formatting 2. optimize the calculation by using lists (like above) 3. profiling to assess significance

    and depending on the extent of rewriting permitted... 4. a pre-calculated R/S codon matrix i.e. given AGT and CNC -> how many R and S SHM may be involved and store as 125(5^3) x 125 table in memory. This would avoid the need to translate on the fly.

  5. Jason Vander Heiden reporter

    We don't need to open a new issue. Same problem.

    Best thing at this point is probably to determine how much we care by benchmarking the time it takes on a typical data set. If it's a big deal, we'll see who wants the task. If it's not urgent, then we'll just leave the issue unassigned and active for a later date.

  6. Log in to comment