Wiki
Clone wikiPTMOracle / File_Formats
File Formats
Contents:
PTMOracle file
PTMOracle requires all protein data to be formatted into either an XML-based or tab-separated (TSV) file.
For the XML-based file format, all protein data for a given protein node is enclosed within the <node> and </node > tags, with the unique identifier represented as a key-value pair. Protein data such as the protein sequence, PTMs and sequence annotations including but not limited to, domains, motifs and disordered regions are also enclosed by <property> and </property> tags. The details of each protein data are also represented as either key-value pairs or enclosed in separate tags (e.g. positionOnProtein, annotationStatus and additionalComments). A hypothetical example showing the XML-based format for 2 protein nodes is shown below:
#!xml <?xml version="1.0" encoding="UTF-8" standalone="no"?> <PTMOracle source="Uniprot PTMs" version="1.0"> <node id="P06777"> <properties> <property description="ERCC4" type="Domain"> <positionOnProtein startPos="824" endPos="966" residue="-"/> <annotationStatus>Good</annotationStatus> <additionalComments>Pfam</additionalComments> </property> <property description="Phosphoserine" type="PTM"> <positionOnProtein startPos="1071" endPos="1071" residue="S"/> <annotationStatus>Good</annotationStatus> <additionalComments>Uniprot</additionalComments> </property> <property description="MSQLFYQGDSDDELQEELTRQTTQASQSSKIKNEDEPDDSNHLNEVENEDSKVLDDDAVLY" type="Sequence"> <positionOnProtein startPos="1" endPos="61" residue="-"/> <annotationStatus>Good</annotationStatus> </property> </properties> </node> <node id="P32628"> <properties> <property description="Phosphoserine" type="PTM"> <positionOnProtein startPos="121" endPos="121" residue="S"/> <annotationStatus>Good</annotationStatus> <additionalComments>Uniprot</additionalComments> </property> <property description="ubiquitin" type="Domain"> <positionOnProtein startPos="4" endPos="75" residue="-"/> <annotationStatus>Good</annotationStatus> <additionalComments>Pfam</additionalComments> </property> <property description="MVSLTFKNFKKEKVPLDLEPSNTILETKTKLAQSISCEESQIKLIYSGKVLQDSKTVSECGLK" type="Sequence"> <positionOnProtein startPos="1" endPos="63" residue="-"/> <additionalComments>Uniprot</additionalComments> </property> </properties> </node> </PTMOracle>
For the TSV file, each row in the file corresponds to an individual protein annotation. These may include PTMs, domains, motifs and disordered regions associated with a given protein node. Briefly, the first column of the TSV file contains the identifier for a given protein node, whereas the last 7 columns contain details of the protein annotations (e.g. the start and end positions of the annotation with respect to the protein sequence and amino acid residue). A hypothetical example showing the TSV file format for the same 2 protein nodes is shown below:
#!TSV P06777 Domain ERCC4 824 966 - Good Pfam P06777 PTM Phosphoserine 1071 1071 S Good Uniprot P06777 Sequence MSQLFYQGDSDDELQEELTRQTTQASQSSKIKNEDEPDDSNHLNEVENEDSKVLDDDAVLY 1 61 - Good P32628 PTM Phosphoserine 121 121 S Good Uniprot P32628 Domain ubiquitin 4 75 - Good Pfam P32628 Sequence MVSLTFKNFKKEKVPLDLEPSNTILETKTKLAQSISCEESQIKLIYSGKVLQDSKTVSECGLK 1 63 - Uniprot
Updated