More of a question than issue

Issue #19 resolved
Mateusz Łącki
created an issue

Hi,

First of all, this software is great, thanks. I am using it to access mzXml files, mainly like that:

with mzxml.read(path) as reader:
    k = 0
    for spectrum in reader:
        mz = spectrum['m/z array']
        intensity = spectrum['intensity array']
    yield mz, intensity

My question is: do you have a direct generator of (mz,intensity) tuples? I mean something like that:

with mzxml.read(path) as reader:
    k = 0
    for spectrum in reader:
        for mz, intensity in spectrum:
            yield mz, intensity, spectrum.run_numbers

I don't know how your parser works underneath Python, but I was wondering if you can save on RAM.

Best wishes!

Comments (7)

  1. Lev Levitsky repo owner

    Hi Mateusz,

    thank you for using Pyteomics and for your kind feedback.

    The m/z and intensities from each scan are decoded from binary as numpy arrays, rather than one by one, so no direct iterator is available. You can emulate it by using zip() on the two arrays, though it won't save any RAM.

    Best regards, Lev

  2. Lev Levitsky repo owner

    Do you mean to seamlessly iterate over multiple files, or to perform some kind of spectrum averaging?

    For the former, you can use mzxml.chain instead of mzxml.read. For the latter, no ready solution is available, but with numpy arrays it should not be too hard to do. Let me know if you have further questions.

  3. Mateusz Łącki reporter

    I meant more or less the latter, though I don't want to average the spectra but to put them atop of each other.

    I will give you an example of more or less what I want to achieve as an output.

    Let S1 = {(10.1, 20), (12.2, 200)} and S2 = {(12.25, 50 ), (13.25, 200)} be two very simple spectra. Tuples contain m/z ratio and intensity.

    I want to: 1. get rid of too small peaks (with height smaller than, say 40.0) 2. round the m/z of the remaining peaks to one significant digit 3. sum their intensities so as to obtain: S3 = {(12.2, 200+50=250), (13.25, 200)}

    Your software lets me iterate through S1, S2, ...

    In pure python (ok, using numpy), this can be done using:

    def threshold_round_and_aggregate_peaks(peaks,
                                            spectral_intensity_cut_off=0.0,
                                            precision_digits=2):
        """
        Apply thresholding on peaks,
        round their m/z values,
        and aggregate the result.
    
        Parameters
        =======
        peaks : iterable
            Iterates over tuples (m_over_z, intensity).
        spectral_intensity_cut_off : float
    
        precision_digits : float
            The number of digits after which the floats get rounded.
        Remarks
        =======
        This function is purely pythonic (if the peak iterable is).
        It is 8 times slower than the numpy version.
        """
        nice_spectrum = Counter()
        for mz, intensity in peaks:
            if intensity >= spectral_intensity_cut_off:
                nice_spectrum[round(mz, precision_digits)] += intensity
        return Spectrum(mass=np.array(list(nice_spectrum.keys())),
                        intensity=np.array(list(nice_spectrum.values())))
    

    That's ok, but since you already use numpy, it can be done four to six times faster, by calling read_spectrum_from_mzxml,

    Spectrum = namedtuple('Spectrum', 'mz intensity')
    
    def aggregate(keys, values=None):
        """Aggregate values with the same keys.
    
        Parameters
        ----------
        keys : array
            Keys, usually m/z values.
        values : array
            Values to aggregate, usually intensities.
    
        Returns
        -------
        out : tuple
            A tuple containing unique keys and aggregated values.
        """
        uniqueKeys, indices = np.unique(keys, return_inverse=True)
        return uniqueKeys, np.bincount(indices, weights=values)
    
    
    def stack_spectra(spec1, spec2):
        """Merge two spectra into one, aggregating out the common m/z values.
    
        Parameters
        ----------
        spec1 : tuple of two numpy arrays
            A mass spectrum:
            an array of m/z ratios and an array of corresponding intensities.
        spec2 : tuple of two numpy arrays
            A mass spectrum:
            an array of m/z ratios and an array of corresponding intensities.
        """
        masses_over_charge = np.concatenate((spec1[0], spec2[0]))
        intensities = np.concatenate((spec1[1], spec2[1]))
        return aggregate(masses_over_charge, intensities)
    
    
    def get_distributions_from_mzxml(path,
                                     spectral_intensity_cut_off=0.0,
                                     precision_digits=2):
        """
        Generate a sequence of rounded and trimmed spectra from
        individual runs of the instrument.
    
        Parameters
        ----------
        path : str
            Path to the mzXml file containing the mass spectrum.
        spectral_intensity_cut_off : float
            The cut off value for peak intensity.
        precision_digits : float
            The number of digits after which the floats get rounded.
        Returns
        -------
        out : generator
            Generates tuples of numpy arrays corresponding to different runs
            of the experimental spectrum.
        """
        with mzxml.read(path) as reader:
            for spectrum in reader:
                mzs = spectrum['m/z array']
    
                intensities = spectrum['intensity array']
    
                mzs = mzs[intensities >=
                          spectral_intensity_cut_off]
    
                mzs = np.round(mzs, precision_digits)
                intensities = intensities[intensities >=
                                          spectral_intensity_cut_off]
                yield mzs, intensities
    
    
    def read_spectrum_from_mzxml(path,
                                 spectral_intensity_cut_off=0.0,
                                 precision_digits=2):
        """
        Read spectrum form an mzXml and merge runs of the instrument.
        Parameters
        ----------
        path : str
            Path to the mzXml file containing the mass spectrum.
        spectral_intensity_cut_off : float
            The cut off value for peak intensity.
        precision_digits : float
            The number of digits after which the floats get rounded.
        Returns
        -------
        out : Spectrum
        """
        mz, intensity = reduce(
            stack_spectra,
            get_distributions_from_mzxml(path,
                                         spectral_intensity_cut_off,
                                         precision_digits))
        return Spectrum(mz=mz, intensity=intensity)
    

    I was wondering if this can be done even faster. Do you read spectra in Python, or do you have some C library underneath?

    If it's the latter, then it might be worthwile to write a generator in C and expose it to Python using CFFI.

    Sorry for the code that is not reduced for the post, but I guess you can understand what I want to do anyway :)

    Best wishes!

  4. Joshua Klein

    All low-level XML parsing is done using lxml, a Python binding for libxml2. Using the iterative parsing interface that library provides in Python, we recursively unpack XML subtrees into nested dict objects. One of those tags is the <peaks> tag which holds the m/z, intensity pairs for containing <scan> tag.

    If you're not interested in any metadata, and you're able to assume that every scan should be aggregated (e.g. no interleaved MS1 and MS2 scans), then you can use this technique to get access to a structured numpy array of (m/z, intensity) pairs:

    from pyteomics import mzxml
    from lxml import etree
    
    def iterpeaks(path):
        for event, tag in etree.iterparse(path):
            if tag.tag.endswith("peaks"):
                yield mzxml._decode_peaks(tag.attrib, tag.text)
            tag.clear()
    

    The structured array positions are pairs, but you can access the "m/z array" and "intensity array" dimensions by passing those string keys instead of numerical indices.

    This read all the peaks out of an mzXML file with 14600 scans in 6.4 seconds compared to 10.6 seconds. Most of this time is spent on decompression and base64 decoding, both of which are calling into modules implemented in C in the standard library.

  5. Log in to comment