PairSeq is slow on large files
Issue #1
resolved
Index pairing algorithm in IgCore::indexSeqPairs does not scale well. Needs profiling and improvement. Possibly implement as hash table strategy instead of set intersection.
Comments (2)
-
reporter -
reporter - changed status to resolved
Modified behavior to index one file and iterate the other. Also removed indexSeqPairs() step in favor of passing a key_function to SeqIO.index(). No longer output unpaired files, but it much faster.
Also removed indexSeqPairs() step from AssemblePairs and SplitSeq-samplepairs.
- Log in to comment
This is probably due to file I/O, in particular how Biopython SeqIO.index() accesses specific positions in the file. May need to implement an alternative to SeqIO.index() using the linecache library. Synchronizing the ordering in both files without loading all the sequences into memory is the primary obstacle.