Check performance of DefineClones with `--act set` in current version

Issue #125 on hold
Jason Vander Heiden created an issue

The performance different between --act first and --act set can be huge in v0.3.12. Example of 580K sequences:

--act set

PROGRESS> Grouping sequences
PROGRESS> 14:21:55 (583524) 187.6 min

PROGRESS> Assigning clones
PROGRESS> 14:27:55 |####################| 100% (583,524) 193.2 min

--act first

PROGRESS> Grouping sequences
PROGRESS> 14:57:34 (583524) 4.4 min

PROGRESS> Assigning clones
PROGRESS> 15:01:39 |####################| 100% (583,524) 8.2 min

Comments (14)

  1. Jason Vander Heiden reporter

    There's a similar, but maybe not the same, set algorithm in shazam::distToNearest. We should verify the results are the same, and if the performance is better in the R version mirror the algorithm.

    If the R version is faster, with the same results, then it's likely due to vectorized operations in R.

  2. Jason Vander Heiden reporter

    @skleinstein and I were thinking it might be best to redo the algorithm itself a little. Instead of doing the set unions by J gene, then by V gene, try doing them for the V and J at the same time (but still after junction length and annotation grouping).

    Also, first build an equivalency table for the V/J sets by going through all the V/J annotations once. Then, after you have that table go through the sequences again and index them according to whether they have a V/J pair in the equivalency table.

    Hopefully, that should be more like O(2n). We hope. Needs testing to make sure it works how we think it'll work.

    So, if you (@nimanouri) implement and test the new approach in SCOPe, then I'll port it to DefineClones.

  3. Jason Vander Heiden reporter

    I'm having trouble replicating the --act set performance problem on a similar data set of 500K sequences. Grouping takes about 8 minutes. Which is still a bit slow, but not unreasonably so.

    I'm starting to suspect this is actually a memory problem. DefineClones gobbles up a lot of memory. A chunk of that we can probably fix by using strings instead of Bio.Seq.Seq objects for sequence fields in Receptor.

  4. Jason Vander Heiden reporter

    We could potentially add an argument to control the maximum number of allowed ambiguous gene calls. Eg, --max-genes 3 would discard sequences with more than 3 V or J gene assignments.

    Not sure if it's necessary/ideal though.

  5. Jason Vander Heiden reporter

    Yes, still open. We never swapped out the VJ grouping algorithm in DefineClones.

  6. nima nouri

    I fixed the grouping in alakazam, but Julian improved it further. I am not familiar with algorithm he used. Maybe Jason you could take care of this one as I am not familiar with DefineClones code.

  7. Jason Vander Heiden reporter

    I planned to take care of this before I left, but it didn’t work out. It’s kind of a big task and I don’t realistically have time to tackle it right now. So let’s just continue to back burner it unless someone else can take care of it.

  8. Julian Zhou

    If we’re planning to move all the clonal inference stuff into scoper, then this becomes a non-issue, right?

  9. Jason Vander Heiden reporter

    Yeah, which is why I think it’s fine to back burner it. Though, it would be nice for the DefineClones algorithm to be correct, if it sticks around.

  10. Log in to comment