Wiki

Clone wiki

PTMOracle / File_Formats

File Formats

Contents:


PTMOracle file

PTMOracle requires all protein data to be formatted into either an XML-based or tab-separated (TSV) file.

For the XML-based file format, all protein data for a given protein node is enclosed within the <node> and </node > tags, with the unique identifier represented as a key-value pair. Protein data such as the protein sequence, PTMs and sequence annotations including but not limited to, domains, motifs and disordered regions are also enclosed by <property> and </property> tags. The details of each protein data are also represented as either key-value pairs or enclosed in separate tags (e.g. positionOnProtein, annotationStatus and additionalComments). A hypothetical example showing the XML-based format for 2 protein nodes is shown below:

#!xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<PTMOracle source="Uniprot PTMs" version="1.0">
    <node id="P06777">
        <properties>
            <property description="ERCC4" type="Domain">
                <positionOnProtein startPos="824" endPos="966" residue="-"/>
                <annotationStatus>Good</annotationStatus>
                <additionalComments>Pfam</additionalComments>
            </property>
            <property description="Phosphoserine" type="PTM">
                <positionOnProtein startPos="1071" endPos="1071" residue="S"/>
                <annotationStatus>Good</annotationStatus>
                <additionalComments>Uniprot</additionalComments>
            </property>
            <property description="MSQLFYQGDSDDELQEELTRQTTQASQSSKIKNEDEPDDSNHLNEVENEDSKVLDDDAVLY" type="Sequence">
                <positionOnProtein startPos="1" endPos="61" residue="-"/>
                <annotationStatus>Good</annotationStatus>
            </property>
        </properties>
    </node>
    <node id="P32628">
        <properties>
            <property description="Phosphoserine" type="PTM">
                <positionOnProtein startPos="121" endPos="121" residue="S"/>
                <annotationStatus>Good</annotationStatus>
                <additionalComments>Uniprot</additionalComments>
            </property>
            <property description="ubiquitin" type="Domain">
                <positionOnProtein startPos="4" endPos="75" residue="-"/>
                <annotationStatus>Good</annotationStatus>
                <additionalComments>Pfam</additionalComments>
            </property>
            <property description="MVSLTFKNFKKEKVPLDLEPSNTILETKTKLAQSISCEESQIKLIYSGKVLQDSKTVSECGLK" type="Sequence">
                <positionOnProtein startPos="1" endPos="63" residue="-"/>
                <additionalComments>Uniprot</additionalComments>
            </property>
        </properties>
    </node>
</PTMOracle>

For the TSV file, each row in the file corresponds to an individual protein annotation. These may include PTMs, domains, motifs and disordered regions associated with a given protein node. Briefly, the first column of the TSV file contains the identifier for a given protein node, whereas the last 7 columns contain details of the protein annotations (e.g. the start and end positions of the annotation with respect to the protein sequence and amino acid residue). A hypothetical example showing the TSV file format for the same 2 protein nodes is shown below:

#!TSV
P06777  Domain  ERCC4   824 966 -   Good    Pfam
P06777  PTM Phosphoserine   1071    1071    S   Good    Uniprot
P06777  Sequence    MSQLFYQGDSDDELQEELTRQTTQASQSSKIKNEDEPDDSNHLNEVENEDSKVLDDDAVLY   1   61  -   Good    
P32628  PTM Phosphoserine   121 121 S   Good    Uniprot
P32628  Domain  ubiquitin   4   75  -   Good    Pfam
P32628  Sequence    MVSLTFKNFKKEKVPLDLEPSNTILETKTKLAQSISCEESQIKLIYSGKVLQDSKTVSECGLK 1   63  -       Uniprot

Updated