More of a question than issue

Issue #19 resolved
Mateusz Łącki
created an issue

Hi,

First of all, this software is great, thanks. I am using it to access mzXml files, mainly like that:

```with mzxml.read(path) as reader:
k = 0
mz = spectrum['m/z array']
intensity = spectrum['intensity array']
yield mz, intensity
```

My question is: do you have a direct generator of (mz,intensity) tuples? I mean something like that:

```with mzxml.read(path) as reader:
k = 0
for mz, intensity in spectrum:
yield mz, intensity, spectrum.run_numbers
```

I don't know how your parser works underneath Python, but I was wondering if you can save on RAM.

Best wishes!

1. repo owner

Hi Mateusz,

thank you for using Pyteomics and for your kind feedback.

The m/z and intensities from each scan are decoded from binary as numpy arrays, rather than one by one, so no direct iterator is available. You can emulate it by using `zip()` on the two arrays, though it won't save any RAM.

Best regards, Lev

2. reporter

Do you have any out-of-the-box solution to aggregate the spectra from different runs?

3. repo owner

Do you mean to seamlessly iterate over multiple files, or to perform some kind of spectrum averaging?

For the former, you can use `mzxml.chain` instead of `mzxml.read`. For the latter, no ready solution is available, but with numpy arrays it should not be too hard to do. Let me know if you have further questions.

4. reporter

I meant more or less the latter, though I don't want to average the spectra but to put them atop of each other.

I will give you an example of more or less what I want to achieve as an output.

Let S1 = {(10.1, 20), (12.2, 200)} and S2 = {(12.25, 50 ), (13.25, 200)} be two very simple spectra. Tuples contain m/z ratio and intensity.

I want to: 1. get rid of too small peaks (with height smaller than, say 40.0) 2. round the m/z of the remaining peaks to one significant digit 3. sum their intensities so as to obtain: S3 = {(12.2, 200+50=250), (13.25, 200)}

Your software lets me iterate through S1, S2, ...

In pure python (ok, using numpy), this can be done using:

```def threshold_round_and_aggregate_peaks(peaks,
spectral_intensity_cut_off=0.0,
precision_digits=2):
"""
Apply thresholding on peaks,
round their m/z values,
and aggregate the result.

Parameters
=======
peaks : iterable
Iterates over tuples (m_over_z, intensity).
spectral_intensity_cut_off : float

precision_digits : float
The number of digits after which the floats get rounded.
Remarks
=======
This function is purely pythonic (if the peak iterable is).
It is 8 times slower than the numpy version.
"""
nice_spectrum = Counter()
for mz, intensity in peaks:
if intensity >= spectral_intensity_cut_off:
nice_spectrum[round(mz, precision_digits)] += intensity
return Spectrum(mass=np.array(list(nice_spectrum.keys())),
intensity=np.array(list(nice_spectrum.values())))
```

That's ok, but since you already use numpy, it can be done four to six times faster, by calling read_spectrum_from_mzxml,

```Spectrum = namedtuple('Spectrum', 'mz intensity')

def aggregate(keys, values=None):
"""Aggregate values with the same keys.

Parameters
----------
keys : array
Keys, usually m/z values.
values : array
Values to aggregate, usually intensities.

Returns
-------
out : tuple
A tuple containing unique keys and aggregated values.
"""
uniqueKeys, indices = np.unique(keys, return_inverse=True)
return uniqueKeys, np.bincount(indices, weights=values)

def stack_spectra(spec1, spec2):
"""Merge two spectra into one, aggregating out the common m/z values.

Parameters
----------
spec1 : tuple of two numpy arrays
A mass spectrum:
an array of m/z ratios and an array of corresponding intensities.
spec2 : tuple of two numpy arrays
A mass spectrum:
an array of m/z ratios and an array of corresponding intensities.
"""
masses_over_charge = np.concatenate((spec1[0], spec2[0]))
intensities = np.concatenate((spec1[1], spec2[1]))
return aggregate(masses_over_charge, intensities)

def get_distributions_from_mzxml(path,
spectral_intensity_cut_off=0.0,
precision_digits=2):
"""
Generate a sequence of rounded and trimmed spectra from
individual runs of the instrument.

Parameters
----------
path : str
Path to the mzXml file containing the mass spectrum.
spectral_intensity_cut_off : float
The cut off value for peak intensity.
precision_digits : float
The number of digits after which the floats get rounded.
Returns
-------
out : generator
Generates tuples of numpy arrays corresponding to different runs
of the experimental spectrum.
"""
mzs = spectrum['m/z array']

intensities = spectrum['intensity array']

mzs = mzs[intensities >=
spectral_intensity_cut_off]

mzs = np.round(mzs, precision_digits)
intensities = intensities[intensities >=
spectral_intensity_cut_off]
yield mzs, intensities

spectral_intensity_cut_off=0.0,
precision_digits=2):
"""
Read spectrum form an mzXml and merge runs of the instrument.
Parameters
----------
path : str
Path to the mzXml file containing the mass spectrum.
spectral_intensity_cut_off : float
The cut off value for peak intensity.
precision_digits : float
The number of digits after which the floats get rounded.
Returns
-------
out : Spectrum
"""
mz, intensity = reduce(
stack_spectra,
get_distributions_from_mzxml(path,
spectral_intensity_cut_off,
precision_digits))
return Spectrum(mz=mz, intensity=intensity)
```

I was wondering if this can be done even faster. Do you read spectra in Python, or do you have some C library underneath?

If it's the latter, then it might be worthwile to write a generator in C and expose it to Python using CFFI.

Sorry for the code that is not reduced for the post, but I guess you can understand what I want to do anyway :)

Best wishes!

All low-level XML parsing is done using `lxml`, a Python binding for `libxml2`. Using the iterative parsing interface that library provides in Python, we recursively unpack XML subtrees into nested `dict` objects. One of those tags is the `<peaks>` tag which holds the m/z, intensity pairs for containing `<scan>` tag.

If you're not interested in any metadata, and you're able to assume that every scan should be aggregated (e.g. no interleaved MS1 and MS2 scans), then you can use this technique to get access to a structured numpy array of (m/z, intensity) pairs:

```from pyteomics import mzxml
from lxml import etree

def iterpeaks(path):
for event, tag in etree.iterparse(path):
if tag.tag.endswith("peaks"):
yield mzxml._decode_peaks(tag.attrib, tag.text)
tag.clear()
```

The structured array positions are pairs, but you can access the "m/z array" and "intensity array" dimensions by passing those string keys instead of numerical indices.

This read all the peaks out of an mzXML file with 14600 scans in 6.4 seconds compared to 10.6 seconds. Most of this time is spent on decompression and base64 decoding, both of which are calling into modules implemented in C in the standard library.

5. reporter

Cool, this really works faster. I didn't notice before that you can get tuples into a numpy array.

Thanks!