DefineClones indexJunctions is uber slow with action='set'

Issue #14 resolved
Jason Vander Heiden created an issue

DefineClones bygroup --act set is very slow. ~400,000 rows took a couple minutes to index the v/j/junction lengths with action='first' and 20+ hours with action='set'. Not quite sure how long total, as I terminated it before it finished.

Was about 20 minutes with ~75,000 rows, so I'm guessing the set method is probably O(n^2) right now. Implying a nested for loop. There's probably a hash table (set or dict key) method we should use instead.

Comments (2)

  1. Jason Vander Heiden reporter

    I've gotten some complaints about this speed issue from the AIRR group. I think we should either fix this for v0.3.3 or change the default to 'first', the former being preferred. It gives a bad first impression as is, if you try to run it on a large-ish data set, species with incomplete germlines, etc.

  2. Log in to comment