fastq file has unrecognized type
I have a user that uploaded a set of paired-end read files and is trying to run pRESTO on them, but it fails during the assemble stage.
AssemblePairs.py align -1 Bb_R1.t4.fastq -2 Bb_R2.t4.fastq --coord illumina --rc tail --outname Bb_R1.t4
/work/01114/vdj/lonestar/production/presto-0.5.2/lib/python3.5/site-packages/presto-0.5.2-py3.5.egg/EGG-INFO/scripts/AssemblePairs.py:112: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
p_matrix[x, i:] = 1 - stats.binom.cdf(x - 1, k[i:], 0.25) - stats.binom.pmf(x, k[i:], 0.25) / 2.0
/work/01114/vdj/lonestar/production/presto-0.5.2/lib/python3.5/site-packages/presto-0.5.2-py3.5.egg/EGG-INFO/scripts/AssemblePairs.py:131: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
z_matrix[x, j:] = (x - k[j:]/4.0)/np.sqrt(3.0/16.0*k[j:])
START> AssemblePairs
COMMAND> align
FILE1> Bb_R1.t4.fastq
FILE2> Bb_R2.t4.fastq
COORD_TYPE> illumina
ALPHA> 1e-05
MAX_ERROR> 0.3
MIN_LEN> 8
MAX_LEN> 1000
SCAN_REVERSE> False
NPROC> 48
ERROR: File Bb_R1.t4.fastq has an unrecognized type
technically I probably shouldn't be specifying --coord illumina
, but I tried the other possibilities and none of them worked. I also tried running ConvertHeaders with all the different conversion methods but they all produce unrecognized type error. Here are the first few sequences in each file
R1:
@MIG UMI:TCGGCCAACAAA:8
CGCACGTACTAGCAGTGGTATCAACGCAGAGTTCGGTCCAATCAAATCTTGGGGGGAGCACAGACACAGTGCTGCCTGCCCCTTTGTGCCATGGGCTCCAGGCTGCTCTGTTGGGTGCTGCTTTG
+
.5.'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@MIG UMI:ACCTCACGGAGG:14
GGGGCGTACTAGCAGTGGTATCAACGCAGAGTACCTTCACGTGAGGTCTTGGGGGAGAGAAGGTGGTGTGAGGCCATCACGGAAGATGCTGCTGCTTCTGCTGCTTCTGGGGCCAGGCTCCGGGCT
+
..*2IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIEIIIIIIIIII2#
@MIG UMI:ATAAGCCCGAGA:9
CGATCGTACTAGCAGTGGTATCAACGCAGAGTATAATGCCCTGAGATCTTGGGGGAGAGTCCTGCTCCCCTTTCATCAATGCACAGATACAGAAGACCCCTCCGTCATGCAGCATCTGCCATGAG
+
+1%%IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICIIII
R2:
@MIG UMI:TCGGCCAACAAA:6
GGGGGACTCGGCCCTTTATCTTTGCGCCAGCAGCTCTATAGCGGGGGGGACAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGAGGACCTGAACAAGGTGTAGCTAGAATAAG
+
IIII@7I@IIII@@I@I@@I7@@I@II7III@II7I@7@.IIIIIIIIIIIIIIIIII@IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@@IIIIIIIIIII@II7IIIIIIIIIIII..%.
@MIG UMI:ACCTCACGGAGG:8
CGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTAGAGCGGGGGCCTATGGCTACACCTTCGGTTCGGGGACCAGGTTAACCGTTGTAGAGGACCTGAACAAGGTGTAGCTAGAAATAC
+
5IBII;BIIIII.I5I5IIIIBII;IIIIBIIIBIIBIBI;I5IIIIIIIIIIIBIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;II.IIIIIIIIIIII''''
@MIG UMI:ATAAGCCCGAGA:4
CCAGACATCTGTGTACTTCTGTGCCAGCAAGCCCTACGTACAGGATCCTGGAAACACCATATATTTTGGAGAGGGAAGTTGGCTCACTGTTGTAGAGGACCTGAACAACGTGTAGCTAGAACCAA
+
I;.I.II;I;IIIIIII.I;I;III;III;I;;I;;;I;;;;II;;;I;IIIIIIIIIIIIIII;IIII;I;IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;IIIIIIIIIIII.;.#
Comments (10)
-
-
reporter Yeah, I'll see about upgrading pRESTO for the next VDJServer release.
I tried changing the file names, but same error:
AssemblePairs.py align -1 Bb_R1_t4.fastq -2 Bb_R2_t4.fastq --coord illumina --rc tail --outname Bb_R1_t4 START> AssemblePairs COMMAND> align FILE1> Bb_R1_t4.fastq FILE2> Bb_R2_t4.fastq COORD_TYPE> illumina ALPHA> 1e-05 MAX_ERROR> 0.3 MIN_LEN> 8 MAX_LEN> 1000 SCAN_REVERSE> False NPROC> 48 ERROR: File Bb_R1_t4.fastq has an unrecognized type
-
Weird. I'll look at it. It fails with just those top 4 sequences, yes? I suspect it's because the header format isn't supported. Is that MiGEC format?
It would need to use the
UMI:TCGGCCAACAAA
bit to pair the reads and doesn't know how to extract it. I guess we could also add an optional flag to ignore the headers and just blindly trust that the reads are paired in file order. Makes me nervous, but it's an option.Oh, BTW... just noticed.
--nproc 48
probably won't be appreciably faster than--nproc 20
. Scaling is not the best with Python's multiprocessing library... -
reporter I didn't run with only those sequences, I just cut/paste a few sequences from the top of each file. I kinda assumed it was an issue with the read_id (got read id's on the brain today!), but I will try a run with just those few sequences. I've no idea what the format is, I could try to contact the user to ask, but we cannot rely on getting a response. FYI- this isn't a user reported error, I watch jobs and pro-actively check on errors. Users have a tendency to ignore errors and/or not bother to report them...
-
reporter - attached test_r2.fastq
- attached test_r1.fastq
Yep, you can test with these files.
-
Yeah, it's the header. The new presto has a more informative error message:
ERROR: File bad_header/test_r1.fastq is invalid with exception Duplicate key 'MIG'
Hrm. Have to think about what to do with this.
-
- marked as enhancement
-
assigned issue to
- marked as minor
-
This is the problem:
from Bio import SeqIO x = SeqIO.index('/home/jason/Downloads/test_r1.fastq', 'fastq') ValueError: Duplicate key 'MIG'
-
This is indeed a MIGEC header - specifically the consensus sequence output.
Mike is making a change in MIGEC v1.2.7 that should resolve the
Bio.SeqIO.index
incompatibility.I've add a
migec
mode to ConvertHeaders in bc50b30 that should work with the new MIGEC version. -
- changed status to resolved
Sort of resolved. Added a converter in ConvertHeaders, but supporting malformed headers in a general sense would be too much effort.
- Log in to comment
The warning is from an incompatibility between pRESTO v0.5.2 and newer versions of NumPy/SciPy. It should be fixed in v0.5.3. Can you update pRESTO?
The unrecognized type is probably because it thinks
.t4.fastq
is the file extension instead of.fastq
. That shouldn't be happening, because it would mean thatos.path.splitext()
isn't well behaved. I'll take a look tomorrow.