1. Lev Levitsky
  2. pyteomics
  3. Issues
Issue #16 resolved

Checking of string inclusivity in _get_info-related dictionaries of formats using base64 encoded data is dangerous

Joshua Klein
created an issue

In the mzml module, _get_info_smart uses the in operator to check to see if a particular string key is in the info variable returned by _get_info. _get_info usually returns a dict, except when it encounters a tag whose only purpose is to contain text, in which case it returns the tag's text instead of a dict.

This is dangerous if the same string may be present either as a key or as the contents of a tag's text. This should be rare since the range of values tested for is small, only "binary" and "binaryDataArray". Unfortunately, rare events do happen.

In this example, one of the base64 encoded blobs in an mzML file contains the sequence "binary" in the encoded text:


The check in the code is:

        if 'binary' in info:

To make the check handle this scenario:

        if 'binary' in info and isinstance(info, dict):

Conceivably, this could happen anywhere we might receive a string instead of a dictionary, but it's most unpredictable around the base64 blobs.

Comments (4)

  1. Joshua Klein reporter

    On the other hand, we could just try to handle the exception since it is less computationally expensive than isinstance call, but it would make the logic a little more complicated to trace. Here's that approach:

    if 'binary' in info:
                # Prepare for the unlikely scenario that this test
                # is performed on a byte string in Python2
                    types = {'32-bit float': 'f', '64-bit float': 'd'}
                    for t, code in types.items():
                        if t in info:
                            dtype = code
                            del info[t]
                    # sometimes it's under 'name'
                        if 'name' in info:
                            for t, code in types.items():
                                if t in info['name']:
                                    dtype = code
                    compressed = True
                    if 'zlib compression' in info:
                        del info['zlib compression']
                    elif 'name' in info and 'zlib compression' in info['name']:
                        info['name'].remove('zlib compression')
                        compressed = False
                        info.pop('no compression', None)
                            info['name'].remove('no compression')
                            if not info['name']: del info['name']
                        except (KeyError, TypeError):
                    b = info.pop('binary')
                    if b:
                        array = aux._decode_base64_data_array(b, dtype, compressed)
                        array = np.array([], dtype=dtype)
                    for k in info:
                        if k.endswith(' array') and not info[k]:
                            info = {k: self._convert_array(k, array)}
                        found = False
                        # workaround for https://bitbucket.org/levitsky/pyteomics/issues/11
                        if isinstance(info.get('name'), list):
                            knames = info['name']
                            for val in knames:
                                if val.endswith(' array'):
                                    info = {val: self._convert_array(val, array)}
                                    found = True
                        # last fallback
                        if not found:
                            info['binary'] = self._convert_array(None, array)
                except AttributeError:
                    if isinstance(info, dict):
  2. Lev Levitsky repo owner

    First of all, this is amazing.

    Second, I have a feeling that since and should short-circuit, isinstance will only be called when 'binary' in info is true, so the first option should be faster.

  3. Joshua Klein reporter

    Yes, the conditional will short-circuit most of the time, but the try-except would less expensive: http://stackoverflow.com/a/1835844/1137920. Then again, here's the time breakdown to loop over a file with no actual processing for each scan:

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        21322   47.704    0.002   47.704    0.002 {zlib.decompress}
        21322   17.609    0.001   17.609    0.001 {binascii.a2b_base64}
        21322   14.409    0.001   14.409    0.001 {method 'encode' of 'str' objects}
        14249   13.017    0.001  123.122    0.009 xml.py:385(iterfind)
    155944/14248   11.134    0.000  108.205    0.008 xml.py:307(_get_info)
    155944/14248   10.324    0.000  108.338    0.008 mzml.py:83(_get_info_smart)

    so either works, it's just a question of which one will our future selves understand when something else inevitably goes wrong. Then we just need to put the same construct around the if "binaryDataArray" and equivalent locations in mzxml (2-3 places, as the last place is a no-op as of this morning).

    I'll do the compound boolean in the if condition send in the pull request.

  4. Log in to comment