kleinstein / presto / issues / #1 - PairSeq is slow on large files — Bitbucket

Issue #1 resolved

Jason Vander Heiden created an issue 2014-08-24

Index pairing algorithm in IgCore::indexSeqPairs does not scale well. Needs profiling and improvement. Possibly implement as hash table strategy instead of set intersection.

Comments (2)

Jason Vander Heiden reporter
This is probably due to file I/O, in particular how Biopython SeqIO.index() accesses specific positions in the file. May need to implement an alternative to SeqIO.index() using the linecache library. Synchronizing the ordering in both files without loading all the sequences into memory is the primary obstacle.
- 2014-12-05T19:15:43+00:00
Jason Vander Heiden reporter
- changed status to resolved
Modified behavior to index one file and iterate the other. Also removed indexSeqPairs() step in favor of passing a key_function to SeqIO.index(). No longer output unpaired files, but it much faster.

Also removed indexSeqPairs() step from AssemblePairs and SplitSeq-samplepairs.
- 2015-03-12T17:24:15+00:00
Log in to comment

Assignee: Jason Vander Heiden

Type: enhancement

Priority: major

Status: resolved

Votes: 0

Watchers: 1

Jira: the preferred issue tracker for Bitbucket. Join the team!