Redesign CollapseSeq algorithm

Issue #74 new
Jason Vander Heiden created an issue

The current CollapseSeq algorithm is less than ideal from a performance perspective. It needs to be reworked.

Here's an approach with some code to integrate, courtesy of Yaacov:

Here is a working version of collapse. newcollapse.py and sequence.py
can be modified to change the types of files which can be read, and
the format of the output.

The main algorithm is to build a tree of sequences, and then replace
each sequence with the consensus of it and its neighbors with respect
to unknown bases.

The file collapsetree.py has the main structure on which we work.
Right now it is a simple prefix tree, but there are several
improvements which can be done. (These are mostly notes for what I
want/have to do.)

1. Replacing the prefix tree structure with a more randomized one should help the tree branch out faster.

2. The structure and algorithm are mostly parallelizable.

3. Comparing only parts of sequences when feasable should help.

4. Generalize the way sequence features are combined. Maybe also generalize the way sequences are ranked in order to choose which one is included in the collapsed set.

Comments (5)

  1. Former user Account Deleted

    Hi! I’m running into this with some very large libraries and I was wondering what the time lines on this would be. I can help out with some testing/benchmarking if needed. Are the scripts posted here functional? Ideally a solution that can leverage multicore would be great. Since we have so many unique input sequence, I imagine chunking and parallel runs (while keeping track of DUPCOUNT) isn’t really going to speed it up by too much…

    Thanks!

  2. Jason Vander Heiden reporter

    The scripts should be functional. I’m not sure on time lines. Maybe someone else can chime in on that?

    A decent stop-gap method is to just run it with -n 0 for speed and then do the distance-based duplicate removing using alakazam::collapseDuplicates after V(D)J alignment. The collapseDuplicates function is designed for preprocessing of clonal groups, so it's not a perfect solution, but it should work.  The important difference is that you have to group the sequence by either clone_id or something similar and then call collapseDuplicates on each of those groups. Both so the function behaves properly (identical length sequences are required) and to improve performance.

    For example:

    library(alakazam)
    library(airr)
    library(dplyr)
    
    db <- read_rearrangement("file.tsv")
    db_collapse <- db %>%
        mutate(v_gene=getGene(v_call), j_gene=getGene(j_call), sequence_length=nchar(sequence_alignment)) %>%
        group_by(v_gene, j_gene, sequence_length, junction_length) %>%
        do(collapseDuplicates(., add_count=TRUE))
    

    This will first group by V gene, J gene, sequence length and junction length and then collapse.  If you have already performed clonal clustering, then this can be simplified to:

    db_collapse <- db %>%
        group_by(clone_id) %>%
        do(collapseDuplicates(., add_count=TRUE))
    

  3. Jason Vander Heiden reporter

    No, there hasn’t been an update on this yet - that I’m aware of. Sorry. I think the recommended workflow, is still to followup with `alakazam::collapseDuplicates`.

    I pushed a note in the commandline docs that setting n>0 is much more computationally expensive than n=0.

  4. Log in to comment