Add numpress support in mzML

Create issue
Issue #37 resolved
Lev Levitsky repo owner created an issue

Binary data in mzML files can be compressed using numpress algorithms [1] e.g. with MSConvert. Such mzML files cannot be parsed with pyteomics.

It looks like support for those compression types can be added using the Python implementation in [2].

Comments (19)

  1. Lev Levitsky reporter

    Latest commit adds a draft version of this. Should work on Python 2.7 and 3.x with three types of Numpress-compressed mzML files created by MSConvert.

  2. Joshua Klein

    PyMSNumpress is idiomatically C++, making it much slower for Python. I adapted it to be more Python/NumPy friendly: https://github.com/mobiusklein/pynumpress/blob/master/pynumpress/pynumpress.pyx. It should do away with the need to load the output numerical array as a list, and then convert the list into a NumPy array.

    Also, there are layered compression schemes in the controlled vocabulary. I don't know if Proteowizard supports them yet though:

    [Term]
    id: MS:1002746
    name: MS-Numpress linear prediction compression followed by zlib compression
    def: "Compression using MS-Numpress linear prediction compression and zlib." [https://github.com/ms-numpress/ms-numpress]
    is_a: MS:1000572 ! binary data compression type
    
    [Term]
    id: MS:1002747
    name: MS-Numpress positive integer compression followed by zlib compression
    def: "Compression using MS-Numpress positive integer compression and zlib." [https://github.com/ms-numpress/ms-numpress]
    is_a: MS:1000572 ! binary data compression type
    
    [Term]
    id: MS:1002748
    name: MS-Numpress short logged float compression followed by zlib compression
    def: "Compression using MS-Numpress short logged float compression and zlib." [https://github.com/ms-numpress/ms-numpress]
    is_a: MS:1000572 ! binary data compression type
    
  3. Lev Levitsky reporter

    Awesome! Thank you for bringing it up, somehow I didn't find it myself.

    I tried using your implementation in cbffcbd. It is much handier than the awkward C++-style biz, although still requires one intermediate step of converting pure bytes to ndarray.

    It looks like specifying both zlib and numpress flags works with MSConvert, but instead of designated terms you get this:

    <cvParam cvRef="MS" accession="MS:1000574" name="zlib compression" value=""/>
    <cvParam cvRef="MS" accession="MS:1002314" name="MS-Numpress short logged float compression" value=""/>
    <cvParam cvRef="MS" accession="MS:1000515" name="intensity array" value="" unitCvRef="MS" unitAccession="MS:1000131" unitName="number of detector counts"/>
    

    This is of course not handled correctly by Pyteomics.

  4. Lev Levitsky reporter

    Update: more recent version (3.0.19073) actually uses the correct CV term. The previous comment pertains to 3.0.5533.

  5. Lev Levitsky reporter

    Added support for the layered compression. Also, the transform step apparently still needed to be amended for numpress output, because if a numpy array is fed to np.frombuffer with a non-matching dtype, the data get corrupted.

  6. Joshua Klein

    Yeah, np.frombuffer just grabs the buffer and re-interprets the bytes in that buffer as chunks of size dtype.itemsize. MSNumpress deals strictly with doubles, which for most compilers will be float64. It might make sense to test if astype can be invoked with copy=False, since that float64 array will not be used.

    I thought Cython was handling generic conversion of Python data types to C++ std::vector<char>, but apparently it was silently permitting the code to compile with an error to be thrown at run time. It'll amount to a fused function essentially doing the conversion you are already doing.

  7. Lev Levitsky reporter

    Thank you.

    I was wrote a simple test for this functionality and encountered the following error on Python 2.7:

    In [5]: pynumpress.encode_slof(data, 1.0)
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-5-76e6a6156fc5> in <module>()
    ----> 1 pynumpress.encode_slof(data, 1.0)
    
    pynumpress/pynumpress.pyx in pynumpress.pynumpress.__pyx_fused_cpdef()
    
    /usr/lib/python2.7/site-packages/numpy/__init__.py in <module>()
        140     from . import _distributor_init
        141 
    --> 142     from . import core
        143     from .core import *
        144     from . import compat
    
    /usr/lib/python2.7/site-packages/numpy/core/__init__.py in <module>()
         38 
         39 try:
    ---> 40     from . import multiarray
         41 except ImportError as exc:
         42     import sys
    
    /usr/lib/python2.7/site-packages/numpy/core/multiarray.py in <module>()
         10 import warnings
         11 
    ---> 12 from . import overrides
         13 from . import _multiarray_umath
         14 import numpy as np
    
    /usr/lib/python2.7/site-packages/numpy/core/overrides.py in <module>()
         44     ------
         45     TypeError : if no implementation is found.
    ---> 46     """)
         47 
         48 
    
    RuntimeError: implement_array_function method already has a docstring
    

    I also get stuff like this:

    Traceback (most recent call last):
      File "test_mzml.py", line 169, in test_numpress_slof
        encoded = base64.b64encode(pynumpress.encode_slof(data, pynumpress.optimal_slof_fixed_point(data)).tobytes()).decode('ascii')
      File "pynumpress/pynumpress.pyx", line 79, in pynumpress.pynumpress.__pyx_fused_cpdef
      File "/usr/lib/python2.7/site-packages/numpy/__init__.py", line 140, in <module>
        from . import _distributor_init
    ImportError: cannot import name _distributor_init
    

    This feels like an import path mix-up, but I haven't pinpointed it yet.

    On Python 3, the code works. What could be the problem? Should I report it on your repo?

  8. Joshua Klein

    This appears to be related to the state of the shared NumPy extension state. Do you have the complete shell history leading to that command?

  9. Lev Levitsky reporter

    And again BitBucket failed to notify me of your reply.

    Here's one possible minimal example (Python 2, LInux).

    In [1]: import pynumpress
    
    In [2]: pynumpress.encode_slof([], 1.0)
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-2-facd92555439> in <module>()
    ----> 1 pynumpress.encode_slof([], 1.0)
    
    pynumpress/pynumpress.pyx in pynumpress.pynumpress.__pyx_fused_cpdef()
    
    /usr/lib/python2.7/site-packages/numpy/__init__.py in <module>()
        140     from . import _distributor_init
        141 
    --> 142     from . import core
        143     from .core import *
        144     from . import compat
    
    /usr/lib/python2.7/site-packages/numpy/core/__init__.py in <module>()
         38 
         39 try:
    ---> 40     from . import multiarray
         41 except ImportError as exc:
         42     import sys
    
    /usr/lib/python2.7/site-packages/numpy/core/multiarray.py in <module>()
         10 import warnings
         11 
    ---> 12 from . import overrides
         13 from . import _multiarray_umath
         14 import numpy as np
    
    /usr/lib/python2.7/site-packages/numpy/core/overrides.py in <module>()
         44     ------
         45     TypeError : if no implementation is found.
    ---> 46     """)
         47 
         48 
    
    RuntimeError: implement_array_function method already has a docstring
    
  10. Joshua Klein

    Odd. I can run those two lines on both Windows and Linux under Py2.7 without issue. The decoding step crashes spectacularly however. What are your NumPy and Cython versions?

  11. Joshua Klein

    Hmm. I'm not able to reproduce it. Most of the reports I'm seeing about <method> method already has a docstring seem to be related to repeated initialization of the C-extension module. I'll see what else I can try out on other machines. Can you try accessing a specialization of encode_slof?

    import pynumpress
    
    pynumpress.encode_slof[list]([], 1.0)
    
  12. Lev Levitsky reporter

    This code works fine:

    In [1]: import pynumpress
       ...: 
       ...: pynumpress.encode_slof[list]([], 1.0)
       ...: 
    Out[1]: array([ 63, 240,   0,   0,   0,   0,   0,   0], dtype=uint8)
    
  13. Joshua Klein

    Looks like the issue isn’t gathering any attention from Cython, and the NumPy maintainers weren’t sure what was happening either. I’ll fix the problem by adding some compile-time logic to generate different code for Linux on Py2.7.

  14. Joshua Klein

    https://github.com/mobiusklein/pynumpress/tree/master appears to be working on Py2.7 and Py3 on both Windows and Linux. Could you install from master and let me know if it works for you please?

    The change boils down to "if this is Python 2 on Linux, generate code s.t. np.ndarray is not a member of the two fused types and it will be handled by the object case at run time rather than getting its own compile-time specialization”. The code will otherwise look identical when compiled under any other conditions.

    The one downside to this fix is that the compilation changes are Cython compile-time changes, not C, so that means that if I were to build an sdist of the source code and upload it to PyPI from my Windows machine, that conditional code generation change won’t be in effect even if the person who then installs it is on Linux with Python 2.7. This means that I need to add extra logic to the setup script to make Cython a hard requirement if you’re on Linux with Python 2.7, or to at least provide a wheel for that particular user. I’ll deal with that if it fixes the problem.

  15. Log in to comment