Loading IdentiPy pep.xml files in PeptideShaker

Issue #6 new
Anonymous created an issue

Hi,

I'm trying to implement support for loading IdentiPy pep.xml files in PeptideShaker (http://compomics.github.io/projects/peptide-shaker.html), but I've come across a couple of issues with the pep.xml files that I hope you can take a look at?

1) The information about the PTMs is missing in the search_summary. We use this information to figure out which PTMs are fixed. Here's an example from how this is annotated Comet (http://comet-ms.sourceforge.net):

<aminoacid_modification aminoacid="M" massdiff="15.994915" mass="147.035400" variable="Y" symbol="*"/> <aminoacid_modification aminoacid="C" massdiff="57.021464" mass="160.030649" variable="N"/>

2) The name of the spectrum file should (according to the pep.xml schema) be written without the file ending, i.e. base_name="./my_spectra.pep.xml"should be changed to base_name="./my_spectra".

3) I don't see how to map back to the originating spectrum from the information provided in the spectrum_query tag? At first I assumed that spectrum="Spectrum_35528" referred to spectrum number 35528 in the mgf file used as input, but this number seem to go higher than the number of spectra in the mgf file? Would it be possible to instead include the spectrumNativeID tag referring to the spectrum title in the mgf file? This is how it is implemented in Comet and makes the mapping straightforward.

4) Have you considered introducing CV terms for the IdentiPy scores? These will for example be needed when converting the results to mzIdentML for submission to PRIDE (https://www.ebi.ac.uk/pride). Here's again an example from Comet: https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMS_1002252

Hopefully these things should not be too had to fix?

Please let me know if you need more details.

Best regards,

Harald

Comments (10)

  1. Lev Levitsky repo owner

    Hi Harald,

    thank you for reaching out! It's great that IdentiPy may get PeptideShaker support. We'll do our best to cooperate.

    We have addressed issues 1 and 2 on your list. Please let us know if it works for you as expected with the latest commit.

    As for 3, we will need additional information. The "spectrum" attribute in the pepXML file is set to the "title" from MGF or "id" from mzML file. If that's not the case for you, we'd be grateful for a sample input file to reproduce the problem.

    We'll look into point 4 as well. Thank you for bringing it into our attention.

  2. Harald

    Hi Lev,

    Thanks for getting back to me so quickly!

    The IdentiPy pep.xml file I was testing for import into PeptideShaker was in fact not created by me, but rather one I got from one of our users. You can see the corresponding PeptideShaker issue here: https://github.com/compomics/peptide-shaker/issues/309, where you will also find a link to the files I was given.

    Looking at it again, the problem of the spectrum mapping seems mainly to be due to me not being familiar with the format of the titles in the given mgf file. Because you are correct in that the "spectrum" attribute matches the spectrum title from the mgf file. However, I always assumed that the "spectrum" attribute was more of an internal identifier inside the pep.xml file and that it would therefore be more correct to use the "spectrumNativeID" attribute to annotate the name (or id if mzML) of the original spectrum? At least this would make the parsing of pep.xml files more consistent, and should be straightforward to add?

    The good news is that after adding a quick hack to parse and use the "spectrum" attribute as if it was the "spectrumNativeID" attribute, I can now load the original pep.xml file into PeptideShaker.

    I will ask Chris to redo his IdentiPy search and will get back to you as soon as we have been able to verify that point 1 and 2 have been solved.

    Best regards,

    Harald

  3. Harald

    Hi again,

    I've now tested the pep.xml file from the latest version of IdentiPy and can confirm that point 2 has been fixed.

    There is however an error in the way the PTMs are annotated as part of the search summary. The "mass" attribute should be the total mass of the residue after adding the PTM, while in the IdentiPy pep.xml file it is set to the monoisotopic mass of the amino acid.

    Here's a comparison between the same PTMs annotated in IdentiPy and Comet:

    IdentiPy:

    <aminoacid_modification aminoacid="C" massdiff="57.021464" mass="103.00919" variable="N"/>

    <aminoacid_modification aminoacid="K" massdiff="229.162932" mass="128.09496" variable="N"/>

    <terminal_modification terminus="n" massdiff="229.162932" mass="459.333689" variable="N"/>

    <aminoacid_modification aminoacid="M" massdiff="15.994915" mass="131.04049" variable="Y"/>

    <terminal_modification terminus="n" massdiff="42.01056" mass="272.181317" variable="Y"/>

    Comet:

    <aminoacid_modification aminoacid="C" massdiff="57.021464" mass="160.030649" variable="N"/>

    <aminoacid_modification aminoacid="K" massdiff="229.162932" mass="357.257895" variable="N"/>

    <terminal_modification terminus="N" massdiff="229.162932" mass="230.170757" variable="N" protein_terminus="N"/>

    <aminoacid_modification aminoacid="M" massdiff="15.994915" mass="147.035400" variable="Y" symbol="#"/>

    <terminal_modification terminus="N" massdiff="42.010565" mass="272.181322" variable="Y" protein_terminus="Y" symbol="*"/>

    Hopefully easy to fix?

    Best regards,

    Harald

  4. Mark Ivanov

    Hi Harald,

    I've fixed the modification mass bug and added "spectrumNativeID" to pepXML.

    I've processed similar file from Chris study (but used mzML prepared with msconvert instead of mgf), the resulting pepXML can be found here: http://pubdata.theorchromo.ru/peptideshaker/

    The search options and fasta were from the link provided by Chris, except the auto-tuning of search parameters was turned off. IdentiPy is filtering identifications for this tuning, but we need to add PeptideShaker decoy patterns support. I've looked through the fasta and found 3 types of decoy patterns:

    sp|W5XKT8_REVERSED|SACA6_HUMAN Sperm acrosome membrane-associated protein 6 OS=Homo sapiens OX=9606 GN=SPACA6 PE=1 SV=2-REVERSED

    tr|X6R8D5_REVERSED|X6R8D5_REVERSED_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=1 SV=2

    sp|CONT_CAS2_BOVIN_REVERSED|

    So, the common rule is splitting by "|" and adding the "_REVERSED" suffix to the second part, right?

    Regards, Mark

  5. Harald

    Hi Mark,

    Thanks for the update. I'm afraid we don't yet support mzML as input. I therefore cannot test your new pep.xml file in PeptideShaker. Would it be possible to recreate the file with mgf as input?

    So, the common rule is splitting by "|" and adding the "_REVERSED" suffix to the second part, right?

    Well, this depends heavily on the format of the FASTA header. You are correct in that "_REVERSED" is the suffix added to the accession number to indicate the decoys, but the accession number may not always be located at the same place.

    The two first examples above are both from UniProt, while the last is a modified UniProt entry as part of adding contaminants from cRAP (https://www.thegpm.org/crap). So here your suggested rule works. But sadly there is no universal format for the headers and thus there is no common rule for where the decoy tag will be located if using other databases.

    So I think the safest option would be to turn the auto-tuning of the search parameters off? This would also make the results from IdentiPy more directly comparable to the other search results supported by PeptideShaker, as most of these would come from SearchGUI (http://compomics.github.io/projects/searchgui.html) and all have the same (as far as possible) search settings.

    Best regards,

    Harald

  6. Mark Ivanov

    I've proceeded the search using mgf from the Chris's GDrive and uploaded results on the link above.

    As for the auto-tuning and fair comparison - it is a debating issue. For example, the msgf+ choose a scoring function on the fly depending on the dataset. Not sure, that it is more fair than our optimization for such comparison. But anyway, the search above was done without auto-tuning.

    Mark

  7. Harald

    Hi Mark,

    I can now parse the mgf-based IdentiPy pep.xml file without any issues. The support for IdentiPy will be included in the next major release of PeptideShaker. Thanks for all the help!

    As for the debate about the fair comparison, I see your point. Is there any way to run the auto-tuning without relying on the detection of the decoys? As there may be different kinds of FASTA headers inside the same FASTA file as well. Hence a single common rule for locating the accession numbers to check for the decoy tag will in many cases not work. Is there a way to make this more generic? Is it for example possible to auto-tune the parameters on the target database separately and then perform the search on target-decoy database with these new parameters?

    BTW, have you considered adding the "search_engine_version" attribute to the "search_summary" in the pep.xml file? Makes it easier to know which version of an algorithm that was used to generate the pep.xml file.

    Best regards,

    Harald

  8. Lev Levitsky repo owner

    I see two ways we can support the SearchGUI decoys in IdentiPy.

    Minimal support would be to let auto-tune recognize decoys generated by SearchGUI. This does not require us to replicate all of the corner cases of decoy generation and can be as simple as looking for REVERSED as a substring of the full unparsed FASTA header. Would that cover all cases? The exact substring to look for can be specified in settings.

    "Full" support would be to also allow IdentiPy to create decoys on-the-fly in the way that is recognized by PeptideShaker. Again, if we go for this option, we don't need to fully replicate the way SearchGUI handles different headers. We only need to make sure that PeptideShaker understands which entries are decoys. Do you think there is a universal way to achieve this?

  9. Harald

    Hi Lev,

    Minimal support would be to let auto-tune recognize decoys generated by SearchGUI. This does not require us to replicate all of the corner cases of decoy generation and can be as simple as looking for REVERSED as a substring of the full unparsed FASTA header. Would that cover all cases? The exact substring to look for can be specified in settings.

    I think this should work in almost all cases. The exception of course being if the user-provided FASTA file already contains headers including the decoy tag. This should be very rare though. Especially if also including the underscore, i.e. "_REVERSED". But there are no guarantees.

    Perhaps an extra test could be added to check if this simple approach detects 50% decoys and provide a warning if it does not? Assuming that there is a 50-50 split between targets and decoys. I think that is a fair assumption?

    "Full" support would be to also allow IdentiPy to create decoys on-the-fly in the way that is recognized by PeptideShaker. Again, if we go for this option, we don't need to fully replicate the way SearchGUI handles different headers. We only need to make sure that PeptideShaker understands which entries are decoys. Do you think there is a universal way to achieve this?

    Well, this is what we generally try to avoid by adding the decoy headers in SearchGUI, as the individual algorithms may all add/create decoys differently, and we need to be able to unify all the results when loading them into PeptideShaker. There may be ways around this, but it could involve major changes in the way we interact with the other algorithms supported in SearchGUI/PeptideShaker. So I'd rather avoid this if possible.

    Best regards,

    Harald

  10. Log in to comment