PairSeq is slow on large files

Issue #1 resolved
Jason Vander Heiden created an issue

Index pairing algorithm in IgCore::indexSeqPairs does not scale well. Needs profiling and improvement. Possibly implement as hash table strategy instead of set intersection.

Comments (2)

  1. Jason Vander Heiden reporter

    This is probably due to file I/O, in particular how Biopython SeqIO.index() accesses specific positions in the file. May need to implement an alternative to SeqIO.index() using the linecache library. Synchronizing the ordering in both files without loading all the sequences into memory is the primary obstacle.

  2. Jason Vander Heiden reporter

    Modified behavior to index one file and iterate the other. Also removed indexSeqPairs() step in favor of passing a key_function to SeqIO.index(). No longer output unpaired files, but it much faster.

    Also removed indexSeqPairs() step from AssemblePairs and SplitSeq-samplepairs.

  3. Log in to comment