Check performance of DefineClones with `--act set` in current version

Issue #125 on hold

Jason Vander Heiden created an issue 2018-04-29

The performance different between --act first and --act set can be huge in v0.3.12. Example of 580K sequences:

--act set

PROGRESS> Grouping sequences
PROGRESS> 14:21:55 (583524) 187.6 min

PROGRESS> Assigning clones
PROGRESS> 14:27:55 |####################| 100% (583,524) 193.2 min

--act first

PROGRESS> Grouping sequences
PROGRESS> 14:57:34 (583524) 4.4 min

PROGRESS> Assigning clones
PROGRESS> 15:01:39 |####################| 100% (583,524) 8.2 min

Comments (14)

Jason Vander Heiden reporter
There's a similar, but maybe not the same, set algorithm in shazam::distToNearest. We should verify the results are the same, and if the performance is better in the R version mirror the algorithm.

If the R version is faster, with the same results, then it's likely due to vectorized operations in R.
- 2018-05-04T16:38:02+00:00
Jason Vander Heiden reporter
- assigned issue to
  
  nima nouri
- 2018-05-04T18:26:04+00:00
Jason Vander Heiden reporter
- marked as enhancement
- 2018-05-04T18:26:15+00:00
Jason Vander Heiden reporter
@skleinstein and I were thinking it might be best to redo the algorithm itself a little. Instead of doing the set unions by J gene, then by V gene, try doing them for the V and J at the same time (but still after junction length and annotation grouping).

Also, first build an equivalency table for the V/J sets by going through all the V/J annotations once. Then, after you have that table go through the sequences again and index them according to whether they have a V/J pair in the equivalency table.

Hopefully, that should be more like O(2n). We hope. Needs testing to make sure it works how we think it'll work.

So, if you (@nimanouri) implement and test the new approach in SCOPe, then I'll port it to DefineClones.
- 2018-05-04T22:09:06+00:00
Jason Vander Heiden reporter
I'm having trouble replicating the --act set performance problem on a similar data set of 500K sequences. Grouping takes about 8 minutes. Which is still a bit slow, but not unreasonably so.

I'm starting to suspect this is actually a memory problem. DefineClones gobbles up a lot of memory. A chunk of that we can probably fix by using strings instead of Bio.Seq.Seq objects for sequence fields in Receptor.
- 2018-06-19T16:19:36+00:00
nima nouri
added new function in sandbox. Need to be checked though.
- 2018-06-20T21:33:35+00:00
Jason Vander Heiden reporter
We could potentially add an argument to control the maximum number of allowed ambiguous gene calls. Eg, --max-genes 3 would discard sequences with more than 3 V or J gene assignments.

Not sure if it's necessary/ideal though.
- 2018-07-02T15:10:16+00:00
ssnn
@Jason Vander Heiden do you know if this issue is still open?
- 2020-03-20T15:34:39+00:00
Jason Vander Heiden reporter
Yes, still open. We never swapped out the VJ grouping algorithm in DefineClones.
- 2020-03-20T16:14:00+00:00
nima nouri
I fixed the grouping in alakazam, but Julian improved it further. I am not familiar with algorithm he used. Maybe Jason you could take care of this one as I am not familiar with DefineClones code.
- 2020-03-20T16:39:16+00:00
Jason Vander Heiden reporter
I planned to take care of this before I left, but it didn’t work out. It’s kind of a big task and I don’t realistically have time to tackle it right now. So let’s just continue to back burner it unless someone else can take care of it.
- 2020-03-20T16:55:37+00:00
Julian Zhou
If we’re planning to move all the clonal inference stuff into scoper, then this becomes a non-issue, right?
- 2020-03-20T16:56:54+00:00
Jason Vander Heiden reporter
Yeah, which is why I think it’s fine to back burner it. Though, it would be nice for the DefineClones algorithm to be correct, if it sticks around.
- 2020-03-20T16:59:23+00:00
ssnn
- changed status to on hold
Let's put his on hold, then
- 2020-03-20T18:13:38+00:00
Log in to comment

Assignee: nima nouri

Type: enhancement

Priority: major

Status: on hold

Milestone: –

Votes: 0

Watchers: 4