Redesign CollapseSeq algorithm

The current CollapseSeq algorithm is less than ideal from a performance perspective. It needs to be reworked.

Here's an approach with some code to integrate, courtesy of Yaacov:

Here is a working version of collapse. newcollapse.py and sequence.py
can be modified to change the types of files which can be read, and
the format of the output.

The main algorithm is to build a tree of sequences, and then replace
each sequence with the consensus of it and its neighbors with respect
to unknown bases.

The file collapsetree.py has the main structure on which we work.
Right now it is a simple prefix tree, but there are several
improvements which can be done. (These are mostly notes for what I
want/have to do.)

1. Replacing the prefix tree structure with a more randomized one should help the tree branch out faster.

2. The structure and algorithm are mostly parallelizable.

3. Comparing only parts of sequences when feasable should help.

4. Generalize the way sequence features are combined. Maybe also generalize the way sequences are ranked in order to choose which one is included in the collapsed set.

‌

Comments (5)