Overview of Molecule curation
Textpresso aided molecule curation
Extract drug terms and associated sentences from papers, automate the population of postgres with the extracted information. Extract drug terms using the following rules:
- only from sections: results, conclusion, discussion, non-sectioned
- use filters: -review[type]-anesthetized-anaesthetized-anesthetised-anaesthetised-immobilized-immobilised-receptor[sentence]
- Download drug (CHEMICAL) list from http://ctd.mdibl.org/downloads/;jsessionid=EB1439FEB0CFB6FBF5977BB176141A41#allchems
Using a list of terms from the CTD, use textpresso to flag papers containing the term or synonym of term, extract relevant sentence. Map drug term with a MeSH ID (need to use both column 1 and column 7 terms from the tsp. if no match, create list of drug terms with no MeSH ID Match MeSH ID with Molecule postgres MeSH ID if no match, enter MeSH ID ->MeSH/CTD or default textpresso term name ->Public Name textpresso term name different case ->synonym curator ->Michael\\ If MeSH ID match is found in postgres Match drug (case sensitive) with Postgres Public if no match, enter term as a synonym (pipe separated). Match WBPaperID with WBPaper entries associated with molecule term if no match, check WBPaper false field if no match in WBPaper false then enter paperID Enter sentence(s) in Textpresso sentence field (need to create) For each matched MeSH ID, enter synonyms in synonym field
Change to the OA ->add WBPaper false field
Change to the Molecule dumper ->Don't dump Michael curator fields
cron job to grab new ids/terms from ctd?
Problem for some drugs:
levamisole for slide mounting -> actually this can probably be ok
sucrose as in sucrose flotation and sucrose gradient can we create an exclusion list?
- pgid -autoassigned
- WBPaper -multiontology, does not need to be filled in
- Public name -big text field, common name used in paper '''make required'''
- Synonyms -big text field, any other names associated with the molecule
- Molecule use -big text field, enter its use, make sure to have a source noted in WBPaper or include a pers comm. in the field
- MeSH/CTD or default -molecules in acedb should be stored with their MeSH UID.
- when no MeSH ID is available, a WBMol ID is assigned (based on the pgid), which will eventually get replaced by the MeSH ID when available. Therefore periodic checks of the WBMolecules against other databases must be scheduled.
- Initially these WBMolIDs were only to be kept internally in Postgres but with the addition of SMMID's to our curation pipeline, molecules with links to SMMID will be collected that do not have MeSH IDs available and will need to be pushed to the website without them.
- Also, molecules entered through XREFs are not suppressed if they only have a WBMol ID, so these molecules will be on the site in relation to these other classes.
- When MeSH IDs are available, WBMol ID's need to be kept as synonyms so the molecule to phenotype mapping can be maintained.
- These Names are scripted to be used in URL constructions for MeSH and CTD database links. By suppressing WBMol_id's URLs to these database were not pushed out to the database. Now that WBMol_id's are not suppressed incorrect URLs will be made and will now have to be suppressed somehow. See jobs for molecule dumper on this page: molecule .ace dumper change made 7/7/11
- If a molecule needs to be created it is given a WBMol ID, which will need to be used as a synonym once a bona fide MeSH ID is avaible -e.g. move the WBMol ID from the default field to the synonym field--talk to the web team about only displaying the public name.
- CasRN -ID text field; assigned by CAS, can be used in multiple chemical databases including ChEMIDplus. A ChEMIDplus db line is generated using the CasRN in the .ace file.
- ChEBI_ID -text field; ID for ChEBI db
- Kegg compound -text field; accession number for KEGG compound db
- SMMID -text field; to link to SMMID DB -also need to change the dumper
- Curator - single ontology
- Remark -big text, not dumped
Two files need to be supplied for each upload:
- Molecule.ace made from karen/Molecule/dump_molecule_ace.pl
- database.ace static .ace also in karen/Molecule : this file contains all the database info for the chemical/small molecule database links, it needs to be manually edited whenever there are changes to the databases associated with molecule data.
During each upload a molecule.ace file will be made in citace by Wen. This file will contain all the molecule cross references from within the RNAi and Variation Phenotype curation, merging them with the molecule data from the molecule list.
What we mean by small molecule
- metabolite (primary and secondary)
- monomers or very small oligomers of nucleic acids, proteins, and polysaccharides
- "Large collections of small molecules (molecular weight about 600 or less), of similar or diverse nature which are used for high-throughput screening analysis of the gene function, protein interaction, cellular processing, biochemical pathways, or other chemical interactions." (from nlm.nih.gov and wikipedia)
- metabolites: precursors, intermediates, or end products of a metabolic pathway
- monomeric or very small oligomeric nucleic acids (not RNAi primers), e.g. ATP, ADP, cAMP, GTP, trinucleotide repeats??
- minerals, ions, salts
?Molecule Name ?Text Public_name ?Text Synonym ?Text DB_info Database ?Database ?Database_field ?Accession_number Gene_regulation Gene_regulator ?Gene_regulation XREF Molecule_regulator Affects_phenotype_of Variation ?Variation #Evidence Strain ?Strain #Evidence Transgene ?Transgene #Evidence RNAi ?RNAi #Evidence Corresponding changes in touched models ?Phenotype_info Affected_by Molecule ?Molecule #Evidence ?Gene_regulation Regulator Molecule_regulator ?Molecule XREF Gene_regulator #Boolean
- Name-> MeSH UID
- when no MeSH ID is available, a WBMol ID is assigned, which are based on the pgid.
- Changes in names will also need to be conveyed to other curators so they can change the name in their curation pipelines so cross-links from their small molecule curation attributes will be retained.
- Public name -> common name in elegans literature
- Synonym -> other names, how do we mine these from other DBs?
- DB_info -> links to entity in other database add following databases to database.ace
Molecules will be linked to genes based on their influence on gene activity altered by variation, overexpression, and RNAi-based knockdown.
Molecules will also be linked to genes through their influence on gene activity directly through gene regulation interactions.
Molecule IDs will be provided, when available, for the following databases:<br>
- Database "NLM_MeSH" "UID"
- Database "CTD" "ChemicalID"
- Database "ChemIDplus" ''using the CasRN''
- Database "ChEBI" "CHEBI_ID"
- Database "KEGG COMPOUND" "ACCESSION_NUMBER"
- Database "SMMID DB" ''will need to add a field to capture their ID''
Initially, we will be using MeSH UIDs, assigned by the NLM, as IDs for the molecules in our database. Due to the more comprehensive coverage of the NLM molecules, and the fact that it is more stably funded, this source was thought to be a good starting point for this project. The list we are starting with is a pared down list of molecules from the NLM, that was created by the Comparative Toxicogenomic Database (CTD), which contains over 130,000 terms. For each term, this list contains a term name, CTD ID, MeSH UID, and where available CAS Registry Numbers. Using the CasRNs, we extracted the ChEBI ID from the Chemical Entities of Biological Interest database entity list, where it existed, along with any KEGG Compound accession number.
A sample molecule.ace record:
Molecule : "C009687"
Database "NLM_MeSH" "UID" "C009687"
Database "CTD" "ChemicalID" "C009687"
Database "ChemIDplus" "19545-26-7"
Database "ChEBI" "CHEBI_ID" "52289"
Database "KEGG COMPOUND" "ACCESSION_NUMBER" "C15181"
To make a working list of reference molecules for the various curation efforts, we used Textpresso to scan for all terms on the list that have been published in the C. elegans corpus. The resulting list is less than 6000 terms. The terms that have been identified in the corpus are available here:
http://textpresso-dev.caltech.edu/michael/molecule-obo-analysis/By-Frequency/ ''This is a directory of files of terms based on the number of times the term appears in the corpus.''
http://textpresso-dev.caltech.edu/michael/molecule-obo-analysis/By-Frequency/all ''This is a list of all terms from the previous files concatenated into one.''
This last file is being used as a starting file for molecule look-up by WB curators.
Caveats and notes:
- The list is now small enough that if we wanted to load it into WB at least we know that every term has some relevance to the literature (although unverified).
- The list is small enough to be amenable to editing through ontology editors like OBOedit (even though it is not an ontology).
- We do not have definitions of the terms, nor are the terms arranged in any hierarchical manner; however other databases do, and we provide links to those websites if an ID is available.
- Terms and synonyms of terms, will be added as needed, this curation effort still needs to be worked out, ideally the list will be incorporated as a selection list for whatever curation tool a curator is using.
[[Category:Curation]] Category:User Guide