Issue #15 resolved

cvParam Unit Awareness

Joshua Klein
created an issue

The current logic for handling cvParam tags in PSIMS standard XML documents discards unit information.

    def _handle_param(self, element, **kwargs):
        """Unpacks cvParam and userParam tags into key-value pairs"""
        types = {'int': int, 'float': float, 'string': str}
        if 'value' in element.attrib:
            try:
                if element.attrib.get('type') in types:
                    value = types[element.attrib['type']](element.attrib['value'])
                else:
                    value = float(element.attrib['value'])
            except ValueError:
                value = element.attrib['value']
            return {element.attrib['name']: value}
        else:
            return {'name': element.attrib['name']}

Here, if a unit were present, there would be attributes such as unitCVRef, unitAccession, and unitName. The first two attributes together can be used to resolve the definition of the unit from a controlled vocabulary, while the third gives a common name for the unit, which is usually enough context for a program to make the correct decision. Since this value is discarded before it reaches the caller, that is impossible. For example, some time-related fields are recorded in minutes while others are in seconds.

I propose three solutions for consideration:

  1. Create a new "Parameter" type which wraps all of the information contained in the cvParam tag, and force all callers to unwrap the object to access the primitive value therein. This breaks all client code but is more future-proof and provides more semantic information.
  2. Create a set of data type specific wrappers for float, int, and str that inherit from the matching builtin type, which has a spot for storing the unit (and other metadata?). This doesn't break client code, but it may mean that unit information is lost when passed through conversion operations.
  3. Rather than create our own wrappers, use a library like units or pint to provide this behavior. These types have been through more testing, but also support dimensionality checking so that you can operate on quantities of the same dimension or dimensionless quantities, but institutes error checking when operating on non-conforming dimensions together. This is hard to predict how it will break client code.

I'm partial to option 2 since 1 involves a lot of rewriting of code that doesn't really care about units, and option 3 is hard to analyze and adds a new dependency that can break in unpredictable ways.

I'm planning to start work on this issue in a few days, but I wanted to get the options reviewed before I started.

Comments (6)

  1. Lev Levitsky repo owner

    That would be a great addition, but sounds like a lot of work. Wouldn't it be easier to convert the values to some standard units instead (minutes to seconds, etc.)? Because in my mind the most common reason why we need this is that we often compare values between different files, and they may be in different units. Simple conversion without any subclassing would solve this. Do you have other use cases in mind? Also, what happens with the current checks for 'type'? Finally, what kind of functionality do you intend to add? Is it only unit metadata attributes or do you mean to write converters, too?

  2. Joshua Klein reporter

    The functionality I intend to add is just using a wrapper type instead of a straight primitive type for storing the value of unit-annotated cvParams. Unit conversion is something that can be left to the application logic. I'm not sure what you mean by "checks for 'type'". If you mean cases where you use instanceof(), the wrapper types are straight subclasses of the primitive type so will pass this check without issue. I might also have to apply some more aggressive patching to the mzXML parser which I wrote quick and dirty, didn't have an up-to-date xsd for reference, and doesn't have semantic data regarding units and so I probably blindly convert times to seconds or crash and burn when input doesn't conform to my expectations.

    Here are the wrapper classes:

    class unitint(int):
        def __new__(cls, value, unit_metadata):
            inst = super(unitint, cls).__new__(cls, value)
            inst.unit_info = unit_metadata
            return inst
    
        def __repr__(self):
            base = super(unitint, self).__repr__()
            if self.unit_info:
                return "%s %s" % (base, self.unit_info)
            else:
                return base
    
    
    class unitfloat(float):
        def __new__(cls, value, unit_metadata):
            inst = super(unitfloat, cls).__new__(cls, value)
            inst.unit_info = unit_metadata
            return inst
    
        def __repr__(self):
            base = super(unitfloat, self).__repr__()
            if self.unit_info:
                return "%s %s" % (base, self.unit_info)
            else:
                return base
    
    
    class unitstr(str):
        def __new__(cls, value, unit_metadata):
            inst = super(unitstr, cls).__new__(cls, value)
            inst.unit_info = unit_metadata
            return inst
    
        def __repr__(self):
            base = super(unitstr, self).__repr__()
            if self.unit_info:
                return "%s %s" % (base, self.unit_info)
            else:
                return base
    

    That's all. Then, in the _handle_param method, I'd use unitint instead of int in the types dictionary propagate the unit information as a dict or perhaps a more specialized container in the unit_info attribute of each instance. If no unit is specified, the .unit_info is set to None so that if new client code were to check for a unit on something that the input data did not include a unit for, they would check for unit_info is None instead of hasattr(value, 'unit_info'). I would do the same for float and str, and because these are straight subclasses, not magical method forwarding wrappers, all type checks and conversions work as normal. unitstr still has all of the methods of str and

    In [4]: unitint(5, 'mz')
    Out[4]: 5 mz
    
    In [5]: x = unitint(5, 'mz')
    
    In [6]: x.unit_info
    Out[6]: 'mz'
    
    In [7]: x * 2
    Out[7]: 10
    
    In [8]: x2 = x * 2
    
    In [9]: x2 
    Out[9]: 10  # Notice unit information has been lost
    
    In [10]: isinstance(x, int)
    Out[10]: True
    
    In [11]: import numpy as np
    
    In [12]: np.array([x])
    Out[12]: array([5])
    
    In [13]: np.array([x]).dtype
    Out[13]: dtype('int32')  # Notice unit information has been lost and implicitly converted into a native int
    

    The only downside is that the metadata is lost if any in-place operations are used, so += would strip the unit metadata, and it does not propagate through copy-making operators like *. These would require rewriting all the operators accordingly, and is not worth it if its not needed. While this method works for scalars, it explicitly ignores things like units on array-valued fields like those in mzML. There are cases where these would need to be handled intelligently, like chromatogram tags have <cvParam cvRef="MS" accession="MS:1000595" name="time array" value="" unitCvRef="UO" unitAccession="UO:0000031" unitName="minute"/> where this unit could just as easily be seconds, but that would mean subclassing ndarray, which I don't know well enough to say how it will behave across the whole PyData ecosystem at the Python and C level.

    One reason why this should be done as opposed to choosing a "sane default" unit for each parameter is that it may not be possible to make the "right choice" in every instance, or it may not be possible to convert units. For example in mzIdentML, mass accuracy parameters for <ParentTolerance> and <FragmentTolerance> are encoded with a cvParam whose value is the magnitude of the tolerance and the unit could be "daltons" or "parts per million", for which there is no 1:1 mapping since "dalton" mass accuracy is fixed while "parts per million" mass accuracy is a dynamic quantity.

    <!-- with Dalton -->
    <ParentTolerance>
      <cvParam accession="MS:1001412" name="search tolerance plus value" value="1.5" cvRef="PSI-MS" unitAccession="UO:0000221" unitName="dalton" unitCvRef="UO"/>
      <cvParam accession="MS:1001413" name="search tolerance minus value" value="1.5" cvRef="PSI-MS" unitAccession="UO:0000221" unitName="dalton" unitCvRef="UO"/>
    </ParentTolerance>
    <!-- with PPM -->
    <ParentTolerance>
      <cvParam accession="MS:1001412" cvRef="PSI-MS" unitCvRef="UO" unitName="parts per million" unitAccession="UO:0000169" value="10.0" name="search tolerance plus value"/>
      <cvParam accession="MS:1001413" cvRef="PSI-MS" unitCvRef="UO" unitName="parts per million" unitAccession="UO:0000169" value="10.0" name="search tolerance minus value"/>
    </ParentTolerance>
    

    Admittedly, cases like this one are uncommon, and the majority of the units could be safely ignored. There are probably more than 10 MB of dead weight in a single 1 hour gradient mzML file dedicated to noting that a quantity has units "m/z" or "number of detector counts" for intensities.

    Also admittedly, I found myself thinking about this only after I got back to writing an mzML serializer. Ideally, I'd read values in from one mzML file, apply transformations like smoothing, centroiding, deisotope/charge state deconvolute, and then write directly out to a new (smaller) mzML file. I realized that once units were stripped, I had to hard code them back in. If you'd accept the pull request when the serializer library is finished, these unit wrappers would hook into the 'Units' ontology that PSI-MS uses to retrieve more metadata. The working copy for that is hosted separately while I work on it: https://github.com/mobiusklein/psims

  3. Lev Levitsky repo owner

    By current checks for 'type' I meant the 'type' attribute checking in the current code of _handle_param. You're saying we should look at unitCVRef, unitAccession, and unitName instead? What happens with the 'type' info? (oh, wait, is "type" for userParam and "unitName" for cvParam? Seems like it in my test files)

    Also, there is much more code for conversion of non-cvparam stuff using _schema_info, which also uses the same kind of checks for "type" attribute (although on the schema tree elements). I remember having to edit _schema_defaults manually for some elements. Do we want to keep these mechanisms separate?

    In the end there is no difference between values from cvParam tags and other elements in the final dict. Do we want to keep everything as is except for _handle_param and have both regular floats and unitfloats mixed in the output? I'm just asking because these things have always seemed quite abstract for me, and I never really got into the philosophy behind the structure when I was writing the parsers (which you helped re-write at least once already), so I'm not sure what would be the "proper" way. I see that your proposal doesn't break anything, though, so I have no reasons to oppose it.

  4. Joshua Klein reporter

    Schema value types, taken from the document XSDs, are abstract datatypes in their own right. Some are directly mappable to primitive types, as we do. Others might be compound datatypes, which are made up of compositions of elements. They might imply units, but they must be consistent or somehow encoded by another property of the document. For PSI XML, this kind of design was superseded by these semantic parameters. This is not the case for mzXML where they do things like encode time in a string with the prefix "PT" and suffix "S" (for seconds) wrapped around a floating point number, where the XSD description of these attributes is "time duration type". That means that documents like that require special handling. I might address that for mzXML if I can find full documentation. Doing that generally is beyond the scope of the work I propose to do.

    As for mixing unit- and plain value types, it would make a cleaner API if everything were consistent. That could be handled in xml.XMLValueConverter where unit_info is just set to None since units cannot be inferred automatically. I'm tempted to ignore this case because "the user will know when to expect a unit and when not to", but I think that because I've had to study the schemas for the document types I use, and if I were to be exposed to a new document type tomorrow, without documentation about different values, I'd probably want to not guess at whether a value had a unit.

    userParam's type field is an inline XSD specification for that tag's value, since by definition, userParams cannot be pre-specified.

    <userParam name="[Thermo Trailer Extra]Monoisotopic M/Z:" value="810.41522216796875" type="xsd:float"/>
    

    cvParam's identity implies the data type in its CV definition:

    [Term]
    id: MS:1000041
    name: charge state
    def: "The charge state of the ion, single or multiple and positive or negatively charged." [PSI:MS]
    synonym: "z" EXACT []
    xref: value-type:xsd\:int "The allowed value-type for this CV term." # <-- data type
    is_a: MS:1000455 ! ion selection attribute
    

    As it happens, we don't use this information since the parsers don't access the CV to resolve terms, so we always make them float if numeric or str otherwise (snippet of _handle_param):

     if 'value' in element.attrib:
                try:
                    if element.attrib.get('type') in types:  # If the type is specified, as in userParam
                        value = types[element.attrib['type']](element.attrib['value'])
                    else: # Otherwise treat it like a float
                        value = float(element.attrib['value'])
                except ValueError:
                    value = element.attrib['value']  # Handle the case where it cannot be converted into a numeric value from a string by just leaving it as a string
                return {element.attrib['name']: value}
    
    <selectedIon>
      <cvParam cvRef="MS" accession="MS:1000744" name="selected ion m/z" value="1061.424072265625" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
      <cvParam cvRef="MS" accession="MS:1000041" name="charge state" value="3"/> <!-- will be converted to float instead int -->
      <cvParam cvRef="MS" accession="MS:1000042" name="peak intensity" value="7.069659375e05" unitCvRef="MS" unitAccession="MS:1000131" unitName="number of detector counts"/>
    </selectedIon>
    
  5. Log in to comment