fastq file has unrecognized type

Issue #54 resolved
Scott Christley created an issue

I have a user that uploaded a set of paired-end read files and is trying to run pRESTO on them, but it fails during the assemble stage.

AssemblePairs.py align -1 Bb_R1.t4.fastq -2 Bb_R2.t4.fastq --coord illumina --rc tail --outname Bb_R1.t4
/work/01114/vdj/lonestar/production/presto-0.5.2/lib/python3.5/site-packages/presto-0.5.2-py3.5.egg/EGG-INFO/scripts/AssemblePairs.py:112: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  p_matrix[x, i:] = 1 - stats.binom.cdf(x - 1, k[i:], 0.25) - stats.binom.pmf(x, k[i:], 0.25) / 2.0
/work/01114/vdj/lonestar/production/presto-0.5.2/lib/python3.5/site-packages/presto-0.5.2-py3.5.egg/EGG-INFO/scripts/AssemblePairs.py:131: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  z_matrix[x, j:] = (x - k[j:]/4.0)/np.sqrt(3.0/16.0*k[j:])
       START> AssemblePairs
     COMMAND> align
       FILE1> Bb_R1.t4.fastq
       FILE2> Bb_R2.t4.fastq
  COORD_TYPE> illumina
       ALPHA> 1e-05
   MAX_ERROR> 0.3
     MIN_LEN> 8
     MAX_LEN> 1000
SCAN_REVERSE> False
       NPROC> 48

ERROR:  File Bb_R1.t4.fastq has an unrecognized type

technically I probably shouldn't be specifying --coord illumina, but I tried the other possibilities and none of them worked. I also tried running ConvertHeaders with all the different conversion methods but they all produce unrecognized type error. Here are the first few sequences in each file

R1:

@MIG UMI:TCGGCCAACAAA:8
CGCACGTACTAGCAGTGGTATCAACGCAGAGTTCGGTCCAATCAAATCTTGGGGGGAGCACAGACACAGTGCTGCCTGCCCCTTTGTGCCATGGGCTCCAGGCTGCTCTGTTGGGTGCTGCTTTG
+
.5.'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@MIG UMI:ACCTCACGGAGG:14
GGGGCGTACTAGCAGTGGTATCAACGCAGAGTACCTTCACGTGAGGTCTTGGGGGAGAGAAGGTGGTGTGAGGCCATCACGGAAGATGCTGCTGCTTCTGCTGCTTCTGGGGCCAGGCTCCGGGCT
+
..*2IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIEIIIIIIIIII2#
@MIG UMI:ATAAGCCCGAGA:9
CGATCGTACTAGCAGTGGTATCAACGCAGAGTATAATGCCCTGAGATCTTGGGGGAGAGTCCTGCTCCCCTTTCATCAATGCACAGATACAGAAGACCCCTCCGTCATGCAGCATCTGCCATGAG
+
+1%%IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICIIII

R2:

@MIG UMI:TCGGCCAACAAA:6
GGGGGACTCGGCCCTTTATCTTTGCGCCAGCAGCTCTATAGCGGGGGGGACAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGAGGACCTGAACAAGGTGTAGCTAGAATAAG
+
IIII@7I@IIII@@I@I@@I7@@I@II7III@II7I@7@.IIIIIIIIIIIIIIIIII@IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@@IIIIIIIIIII@II7IIIIIIIIIIII..%.
@MIG UMI:ACCTCACGGAGG:8
CGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTAGAGCGGGGGCCTATGGCTACACCTTCGGTTCGGGGACCAGGTTAACCGTTGTAGAGGACCTGAACAAGGTGTAGCTAGAAATAC
+
5IBII;BIIIII.I5I5IIIIBII;IIIIBIIIBIIBIBI;I5IIIIIIIIIIIBIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;II.IIIIIIIIIIII''''
@MIG UMI:ATAAGCCCGAGA:4
CCAGACATCTGTGTACTTCTGTGCCAGCAAGCCCTACGTACAGGATCCTGGAAACACCATATATTTTGGAGAGGGAAGTTGGCTCACTGTTGTAGAGGACCTGAACAACGTGTAGCTAGAACCAA
+
I;.I.II;I;IIIIIII.I;I;III;III;I;;I;;;I;;;;II;;;I;IIIIIIIIIIIIIII;IIII;I;IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;IIIIIIIIIIII.;.#

Comments (10)

  1. Jason Vander Heiden

    The warning is from an incompatibility between pRESTO v0.5.2 and newer versions of NumPy/SciPy. It should be fixed in v0.5.3. Can you update pRESTO?

    The unrecognized type is probably because it thinks .t4.fastq is the file extension instead of .fastq. That shouldn't be happening, because it would mean that os.path.splitext() isn't well behaved. I'll take a look tomorrow.

  2. Scott Christley reporter

    Yeah, I'll see about upgrading pRESTO for the next VDJServer release.

    I tried changing the file names, but same error:

    AssemblePairs.py align -1 Bb_R1_t4.fastq -2 Bb_R2_t4.fastq --coord illumina --rc tail --outname Bb_R1_t4
           START> AssemblePairs
         COMMAND> align
           FILE1> Bb_R1_t4.fastq
           FILE2> Bb_R2_t4.fastq
      COORD_TYPE> illumina
           ALPHA> 1e-05
       MAX_ERROR> 0.3
         MIN_LEN> 8
         MAX_LEN> 1000
    SCAN_REVERSE> False
           NPROC> 48
    
    ERROR:  File Bb_R1_t4.fastq has an unrecognized type
    
  3. Jason Vander Heiden

    Weird. I'll look at it. It fails with just those top 4 sequences, yes? I suspect it's because the header format isn't supported. Is that MiGEC format?

    It would need to use the UMI:TCGGCCAACAAA bit to pair the reads and doesn't know how to extract it. I guess we could also add an optional flag to ignore the headers and just blindly trust that the reads are paired in file order. Makes me nervous, but it's an option.

    Oh, BTW... just noticed. --nproc 48 probably won't be appreciably faster than --nproc 20. Scaling is not the best with Python's multiprocessing library...

  4. Scott Christley reporter

    I didn't run with only those sequences, I just cut/paste a few sequences from the top of each file. I kinda assumed it was an issue with the read_id (got read id's on the brain today!), but I will try a run with just those few sequences. I've no idea what the format is, I could try to contact the user to ask, but we cannot rely on getting a response. FYI- this isn't a user reported error, I watch jobs and pro-actively check on errors. Users have a tendency to ignore errors and/or not bother to report them...

  5. Jason Vander Heiden

    Yeah, it's the header. The new presto has a more informative error message:

    ERROR:  File bad_header/test_r1.fastq is invalid with exception Duplicate key 'MIG'
    

    Hrm. Have to think about what to do with this.

  6. Jason Vander Heiden

    This is the problem:

    from Bio import SeqIO
    x = SeqIO.index('/home/jason/Downloads/test_r1.fastq', 'fastq')
    
    ValueError: Duplicate key 'MIG'
    
  7. Jason Vander Heiden

    This is indeed a MIGEC header - specifically the consensus sequence output.

    Mike is making a change in MIGEC v1.2.7 that should resolve the Bio.SeqIO.index incompatibility.

    I've add a migec mode to ConvertHeaders in bc50b30 that should work with the new MIGEC version.

  8. Jason Vander Heiden

    Sort of resolved. Added a converter in ConvertHeaders, but supporting malformed headers in a general sense would be too much effort.

  9. Log in to comment