Check performance of DefineClones with `--act set` in current version
The performance different between --act first
and --act set
can be huge in v0.3.12. Example of 580K sequences:
--act set
PROGRESS> Grouping sequences
PROGRESS> 14:21:55 (583524) 187.6 min
PROGRESS> Assigning clones
PROGRESS> 14:27:55 |####################| 100% (583,524) 193.2 min
--act first
PROGRESS> Grouping sequences
PROGRESS> 14:57:34 (583524) 4.4 min
PROGRESS> Assigning clones
PROGRESS> 15:01:39 |####################| 100% (583,524) 8.2 min
Comments (14)
-
reporter -
reporter -
assigned issue to
-
assigned issue to
-
reporter - marked as enhancement
-
reporter @skleinstein and I were thinking it might be best to redo the algorithm itself a little. Instead of doing the set unions by J gene, then by V gene, try doing them for the V and J at the same time (but still after junction length and annotation grouping).
Also, first build an equivalency table for the V/J sets by going through all the V/J annotations once. Then, after you have that table go through the sequences again and index them according to whether they have a V/J pair in the equivalency table.
Hopefully, that should be more like O(2n). We hope. Needs testing to make sure it works how we think it'll work.
So, if you (@nimanouri) implement and test the new approach in SCOPe, then I'll port it to DefineClones.
-
reporter I'm having trouble replicating the
--act set
performance problem on a similar data set of 500K sequences. Grouping takes about 8 minutes. Which is still a bit slow, but not unreasonably so.I'm starting to suspect this is actually a memory problem. DefineClones gobbles up a lot of memory. A chunk of that we can probably fix by using strings instead of Bio.Seq.Seq objects for sequence fields in Receptor.
-
added new function in sandbox. Need to be checked though.
-
reporter We could potentially add an argument to control the maximum number of allowed ambiguous gene calls. Eg,
--max-genes 3
would discard sequences with more than 3 V or J gene assignments.Not sure if it's necessary/ideal though.
-
@Jason Vander Heiden do you know if this issue is still open?
-
reporter Yes, still open. We never swapped out the VJ grouping algorithm in DefineClones.
-
I fixed the grouping in alakazam, but Julian improved it further. I am not familiar with algorithm he used. Maybe Jason you could take care of this one as I am not familiar with DefineClones code.
-
reporter I planned to take care of this before I left, but it didn’t work out. It’s kind of a big task and I don’t realistically have time to tackle it right now. So let’s just continue to back burner it unless someone else can take care of it.
-
If we’re planning to move all the clonal inference stuff into
scoper
, then this becomes a non-issue, right? -
reporter Yeah, which is why I think it’s fine to back burner it. Though, it would be nice for the DefineClones algorithm to be correct, if it sticks around.
-
- changed status to on hold
Let's put his on hold, then
- Log in to comment
There's a similar, but maybe not the same,
set
algorithm inshazam::distToNearest
. We should verify the results are the same, and if the performance is better in the R version mirror the algorithm.If the R version is faster, with the same results, then it's likely due to vectorized operations in R.