Clone wiki

ky_wbprojects / Molecule_Curation

Overview of Molecule curation

molecule .ace dumper
Molecule on Caltech WBWiki

Molecule Curation

Textpresso aided molecule curation

Extract drug terms and associated sentences from papers, automate the population of postgres with the extracted information. Extract drug terms using the following rules:

Using a list of terms from the CTD, use textpresso to flag papers containing the term or synonym of term, extract relevant sentence. 
Map drug term with a MeSH ID (need to use both column 1 and column 7 terms from the tsp.
	if no match, create list of drug terms with no MeSH ID
Match MeSH ID with Molecule postgres MeSH ID 
	if no match, enter MeSH ID ->MeSH/CTD or default
		textpresso term name ->Public Name 
		textpresso term name different case ->synonym
		curator ->Michael\\
If MeSH ID match is found in postgres 
	Match drug  (case sensitive) with Postgres Public 
		if no match, enter term as a synonym (pipe separated). 
 	Match WBPaperID with WBPaper entries associated with molecule term
		if no match, check WBPaper false field
			if no match in WBPaper false then enter paperID
 	Enter sentence(s) in Textpresso sentence field (need to create)
For each matched MeSH ID, enter synonyms in synonym field
 

Change to the OA ->add WBPaper false field
Change to the Molecule dumper ->Don't dump Michael curator fields
cron job to grab new ids/terms from ctd?

Problem for some drugs:
levamisole for slide mounting -> actually this can probably be ok
sucrose as in sucrose flotation and sucrose gradient can we create an exclusion list?

Molecule OA

One tab

  • pgid -autoassigned
  • WBPaper -multiontology, does not need to be filled in
  • Public name -big text field, common name used in paper '''make required'''
  • Synonyms -big text field, any other names associated with the molecule
  • Molecule use -big text field, enter its use, make sure to have a source noted in WBPaper or include a pers comm. in the field
  • MeSH/CTD or default -molecules in acedb should be stored with their MeSH UID.
    • when no MeSH ID is available, a WBMol ID is assigned (based on the pgid), which will eventually get replaced by the MeSH ID when available. Therefore periodic checks of the WBMolecules against other databases must be scheduled.
    • Initially these WBMolIDs were only to be kept internally in Postgres but with the addition of SMMID's to our curation pipeline, molecules with links to SMMID will be collected that do not have MeSH IDs available and will need to be pushed to the website without them.
    • Also, molecules entered through XREFs are not suppressed if they only have a WBMol ID, so these molecules will be on the site in relation to these other classes.
    • When MeSH IDs are available, WBMol ID's need to be kept as synonyms so the molecule to phenotype mapping can be maintained.
    • These Names are scripted to be used in URL constructions for MeSH and CTD database links. By suppressing WBMol_id's URLs to these database were not pushed out to the database. Now that WBMol_id's are not suppressed incorrect URLs will be made and will now have to be suppressed somehow. See jobs for molecule dumper on this page: molecule .ace dumper change made 7/7/11
  • If a molecule needs to be created it is given a WBMol ID, which will need to be used as a synonym once a bona fide MeSH ID is avaible -e.g. move the WBMol ID from the default field to the synonym field--talk to the web team about only displaying the public name.
  • CasRN -ID text field; assigned by CAS, can be used in multiple chemical databases including ChEMIDplus. A ChEMIDplus db line is generated using the CasRN in the .ace file.
  • ChEBI_ID -text field; ID for ChEBI db
  • Kegg compound -text field; accession number for KEGG compound db
  • SMMID -text field; to link to SMMID DB -also need to change the dumper
  • Curator - single ontology
  • Remark -big text, not dumped

Molecule Upload

Two files need to be supplied for each upload:

  • Molecule.ace made from karen/Molecule/dump_molecule_ace.pl
  • database.ace static .ace also in karen/Molecule : this file contains all the database info for the chemical/small molecule database links, it needs to be manually edited whenever there are changes to the databases associated with molecule data.

During each upload a molecule.ace file will be made in citace by Wen. This file will contain all the molecule cross references from within the RNAi and Variation Phenotype curation, merging them with the molecule data from the molecule list.

What we mean by small molecule

  • drug
  • metabolite (primary and secondary)
  • monomers or very small oligomers of nucleic acids, proteins, and polysaccharides
  • "Large collections of small molecules (molecular weight about 600 or less), of similar or diverse nature which are used for high-throughput screening analysis of the gene function, protein interaction, cellular processing, biochemical pathways, or other chemical interactions." (from nlm.nih.gov and wikipedia)

Approved model

?Molecule

  • metabolites: precursors, intermediates, or end products of a metabolic pathway
  • monomeric or very small oligomeric nucleic acids (not RNAi primers), e.g. ATP, ADP, cAMP, GTP, trinucleotide repeats??
  • chemicals/drugs
  • minerals, ions, salts
?Molecule     Name ?Text
 	       Public_name ?Text
 	       Synonym ?Text
 	       DB_info Database ?Database ?Database_field ?Accession_number 
                Gene_regulation Gene_regulator ?Gene_regulation XREF Molecule_regulator 
 	      Affects_phenotype_of 	Variation ?Variation  #Evidence
 					Strain	?Strain	#Evidence
 					Transgene ?Transgene #Evidence
 					RNAi ?RNAi #Evidence

Corresponding changes in touched models
 ?Phenotype_info    Affected_by  Molecule  ?Molecule    #Evidence
 ?Gene_regulation  Regulator Molecule_regulator   ?Molecule  XREF  Gene_regulator  #Boolean 

Model elements

  • Name-> MeSH UID
    • when no MeSH ID is available, a WBMol ID is assigned, which are based on the pgid.
    • Changes in names will also need to be conveyed to other curators so they can change the name in their curation pipelines so cross-links from their small molecule curation attributes will be retained.
  • Public name -> common name in elegans literature
  • Synonym -> other names, how do we mine these from other DBs?
  • DB_info -> links to entity in other database add following databases to database.ace

Molecule Curation

Drug-phenotype curation

Molecules will be linked to genes based on their influence on gene activity altered by variation, overexpression, and RNAi-based knockdown.

Drug-gene interactions

Molecules will also be linked to genes through their influence on gene activity directly through gene regulation interactions.

Molecule databases

Molecule IDs will be provided, when available, for the following databases:<br>

  • Database "NLM_MeSH" "UID"
  • Database "CTD" "ChemicalID"
  • Database "ChemIDplus" ''using the CasRN''
  • Database "ChEBI" "CHEBI_ID"
  • Database "KEGG COMPOUND" "ACCESSION_NUMBER"
  • Database "SMMID DB" ''will need to add a field to capture their ID''

Molecule list

Initially, we will be using MeSH UIDs, assigned by the NLM, as IDs for the molecules in our database. Due to the more comprehensive coverage of the NLM molecules, and the fact that it is more stably funded, this source was thought to be a good starting point for this project. The list we are starting with is a pared down list of molecules from the NLM, that was created by the Comparative Toxicogenomic Database (CTD), which contains over 130,000 terms. For each term, this list contains a term name, CTD ID, MeSH UID, and where available CAS Registry Numbers. Using the CasRNs, we extracted the ChEBI ID from the Chemical Entities of Biological Interest database entity list, where it existed, along with any KEGG Compound accession number.

A sample molecule.ace record:
Molecule : "C009687"
Public_name "wortmannin"
Database "NLM_MeSH" "UID" "C009687"
Database "CTD" "ChemicalID" "C009687"
Database "ChemIDplus" "19545-26-7"
Database "ChEBI" "CHEBI_ID" "52289"
Database "KEGG COMPOUND" "ACCESSION_NUMBER" "C15181"

To make a working list of reference molecules for the various curation efforts, we used Textpresso to scan for all terms on the list that have been published in the C. elegans corpus. The resulting list is less than 6000 terms. The terms that have been identified in the corpus are available here:
http://textpresso-dev.caltech.edu/michael/molecule-obo-analysis/By-Frequency/ ''This is a directory of files of terms based on the number of times the term appears in the corpus.''
and here:
http://textpresso-dev.caltech.edu/michael/molecule-obo-analysis/By-Frequency/all ''This is a list of all terms from the previous files concatenated into one.''
This last file is being used as a starting file for molecule look-up by WB curators.
Caveats and notes:

  • The list is now small enough that if we wanted to load it into WB at least we know that every term has some relevance to the literature (although unverified).
  • The list is small enough to be amenable to editing through ontology editors like OBOedit (even though it is not an ontology).
  • We do not have definitions of the terms, nor are the terms arranged in any hierarchical manner; however other databases do, and we provide links to those websites if an ID is available.
  • Terms and synonyms of terms, will be added as needed, this curation effort still needs to be worked out, ideally the list will be incorporated as a selection list for whatever curation tool a curator is using.

[[Category:Curation]] Category:User Guide

Updated