A fast simple Illumina QSEQ to Sanger FASTQ format converter in C++ with threads, filtering, and demuxing
Why? A simple AWK script can do the same thing.
Yes, but AWK scripts quickly become clumsy when you try to make it do more complicated things (like say, filter reads based on Illumina scores and retain only cases where both paired ends pass the filter). This implementation makes simple stuff simple while making complicated stuff possible.
Meh, I can hack a perl script together to do the same thing.
I love perl. It's one of my favorite languages, but when you're dealing with 2030GB of data, lets face it, the perl interpretter has a significant cost. This implementation is written with speed in mind and will go basically as fast your I/O can handle (and roughly 20 times faster than a standard perl implementation).
Compile using scons. For example:
and then run "./qseq2fastq --help" for simple usage instructions.
Fix compilation problems?
- As of version 1.2 qseq2fastq requires a C++11 compliant compiler. I recommend clang. You can specify an alternate compiler by compiling using something like this:
- Boost is also required (version 1.35 or later).
As a simple example: "./qseq2fastq qseq" would convert the contents of the "qseq" directory into a FASTQ files in a new directory called "fastq".
For multiplexed reads, typically one "end" (end 2 by default) of qseq files contains the index sequence. qseq2fastq will handle the demuxing for you if you supply a list of valid index sequences using the --indexlist option. For example:
qseq2fastq --indexlist=myindexes.txt qseqdir
Suppose the contents of myindexes.txt looks like this:
ATCACGA CGATGTA TTAGGCA TGACCAA ACAGTGA GCCAATA
Read tuples (e.g. front, index, and back) will be output to separate files like 3_1.idx1.fastq (for indexes matching CGATGTA). Note that by default the indexes read from the index file will be complemented before matching (as generally users have a list of the adapters they applied to their libraries rather than the bases they expect to read out). If you are supplying a raw list of index sequences, try the --noindexcomp option. Read tuples which don't match any index are output without a ".idx_" suffix (e.g. 3_1.fastq). The number of mismatches permitted is calculated automatically such that no index string may be ambiguous. You can also set a stricter threshold with --indexdist (but not a weaker threshold).