Clone wiki

ggi_OA / Home

gene_gene_interaction OA

http://wiki.wormbase.org/index.php/Gene_Interaction#OA_interface interaction OA wormbase wiki

A centre place for curation gene_gene_interaction data, including manual curation from papers and textpresso sentences. Gary will get his interaction ID for RNAi objects, Karen for her phenotype curation, and Xiaodong is the point person on working with Juancarlos.

some important scripts

parse_ace_interaction_phenote.pl read in data from WS220 once (when making OA) and data from Gary and Chris interaction data (created by Igor's script) repeatedly (monthly)

parse_ace_interaction_oa.pl is the modified version for writing to OA style data

move_ggi_to_int.pl get Xiaodong's textpresso ggi data into phenote table (when making OA)

oa_interaction_dumper run manually monthly for citace upload

populate_textpresso_ggi_to_OA.pl run manually. not ready yet, first clean up data on mangolassi at /home/acedb/xiaodong/textpresso_ggi/20110106/35225-35725.txt

assign_interaction_ids.pl cronjob to assign WBInteraction ids to OA objects daily at 4 am in the morning

replaceDeadGenesFromInteraction.pl done on 12/15/2011 scripts reads input file: 27155_interaction.ace/31465_interaction.ace, replaces dead genes with merged gene, suppresses objects with completely dead genes, outputs file: 27155_interactionReplaced.ace/31465_interactionReplaced.ace, and writes completely dead genes on screen. script needs to be run for each upload. send output file to Wen for upload.

  • original file:LargeScaleInteraction.ace, got from Wen actually only contain one large scale paper WBPaper00027155. so I changed the file name to: 27155_interaction.ace (cp LargeScaleInteraction.ace 27155_interaction.ace)
  • run script twice by using separate input files: 27155_interaction.ace and 31465_interaction.ace. I will have to modify script by changing input files names and output files names each time
  • send two output files: 27155_interactionReplaced.ace and 31465_interactionReplaced.ace every time when upload
  • J added two zeros for int IDs in two *Replace.ace files, and made copies for two original files with 7 digits IDs in files *Replace.ace.7digit. Chris will use two *Replace.ace files for converting to new interaction format
  • will use new formatted files Chris generates for WS231 upload as input files to check dead gene for future upload. the file is located at tazendra: /home/acedb/xiaodong/oa_interactions_dumper/Large_scale_interactions_WS231.ace

overall plan

  • first, fix andrei's non-directionality in postgres. done on tazendra -J
  • then write a dumper to test it looks okay without IDs. dumper written, doesn't dump variation nor transgene, not tested by requester -J I saw both variation and transgene in OA (sandbox) when I query under Karen's name. Why they are not dumped? -X Only data with Interaction IDs are dumped -J I see. so basically, at this moment, only data from Andrei got dumped. and I don't believe he has transgene/variation in his data. -X tried to dump .ace in sandbox after J repopulated OA this morning (11/10) and tested in acedb. they are good! -X (sandbox)
  • then add textpresso data and test with dumper transferred textpresso data to OA on sandbox, need to do in live site later -J I think now you can transfer my textpresso data to OA please. -X done (11/10/2010)-- J they look good in phenote/OA. I can't test them with dumper now, since they won't get dumped without IDs. -X You should query some through the phenote/OA, add some fake names, and test the dumper / data -- J I tested the dumper and read in acedb with fake ids. they are good! thanks! -X Great ! -- J (sandbox)
  • then add WS220 .ace data and test with dumper. shall we do this step now? -X We have gone over with Juancarlos, Gary, Xiaodong, what the script does. run on sandbox, read to postgres. I have checked data we have currently in phenote/OA and tested dumper for .ace. they look good. For those objects without either geneone or genetwo, it is because they have either transgene or variation. so when dump to .ace, they got right genes in .ace file. Transgene dump has some issue currently (as I write in 'How OA dumper works section'. We will talk more on Monday.-X I think I fixed it -- J yes, it's fixed. thank you! - X
  • then assign IDs to everything without an ID It's time to assign ID now? 11/15 -X assigned IDs on sandbox - J
  • then move phenote -> OA we should be here then. -X 11/23 done 2010 11 29 -- J
  • write OA - shall we now move to this step? -X we haven't yet moved the data from phenote to OA format, just checked that all the data was in shape to be moved -- J OA should work now, except for the textpresso-based new sentences pipeline. we need to talk about this -- J Sure, we can talk anytime you want. we have some notes in this wiki. see 'OA function questions' below. -X oh, great, thanks -J
  • test OA OA looks nice! thanks! I can't tell any obvious wrong now. I will test more and also have Karen and Gary try to enter some mock objects. -X
  • edit and test dumper dumper edited to work with OA format -- J I tested the dumper and read .ace file into acedb. everything went in smoothly. thanks. -X
  • edit OA to deal with textpresso gene gene interaction data ? -- J
  • X moves geneextra to proper effector/effected location and delete 5 extra tables (maybe keep treatment)

OA interface

  • Tab 1
    • PGID
    • Interaction IDgenerate automatically per entry. decide to be ontology field. -X 12/02 need details on what to autocomplete on, what to show in editor / table / term info -- J no other OA configs is using this ID. it can autocomplete on numbers. I would like curator's name shown in term info. -X 12/13 Done 12/13, but be aware that if you edit that ID to be anything else, you won't be able to bring it back, because it will no longer be an allowed ID in the ontology. So if you want to blank a duplicate so it gets a new ID from the cronjob, that's good. If you mistakenly make a typo and assign a correct ID's value to some other ID, you will _not_ be able to bring it back (because it's an ontology) without going to postgres directly and editing the int_name and int_name_hst tables by pgid (in postgres called joinkey). You'll have to note the pgid and then manually change it in postgres. If you were testing the data for WBInteraction0352015 and WBInteraction0352014 by manually assigning those IDs, you can no longer do it, and you should instead make them blank and use the script that assigns new IDs to entries that are blank. -- J Some more thoughts. We're getting the autocomplete from the objects that exist, which are the int_name table, like other OAs, it only autocompletes on what is. We could in this case autocomplete on what could be, meaning the interaction_ticket table, int_index. In that case it would autocomplete on any ticket that was created, but you wouldn't know what's in it, or who has curated it, just that it was created at some point and it may have existed in the past and been deleted, or been checked out and will exist in the future. Part of the reason this is different is that we're dealing with objects that may or may not exist, another reason is that in stuff like paper, it's not stored in OA format, and the pgid is the same as the object ID ; here the pgid is essentially random to allow for one object existing in multiple rows, so two different pgids could refer to any given object ID. At the moment interaction doesn't need this, and we worked at it so that it was set up in a way that it would work out okay this way, but we don't know if it will stay that way (?). It might also be some hassle to parse everything so the pgids and the objectIDs match up. I'm not sure it'd be worth it. Essentially, it's about deciding whether a "valid" Interaction ID is one that exists in the OA (like it is now) or one that was created with the interaction ticket form -- J
    • Non_directional toggle off (default), means the interaction is directional. It involves effector and effected parties. on (color change by click), means it's non_directional.
    • Interaction Typedropdown list with 11 types showing in .ace template
    • Effector Gene autocomplete WBGene, multiontology. corresponding to interactor in .ace file. order does not matter when dump to interactors
    • Effector Variation WBGene, WBVar, autocomplete multiontology on variation, store in separate lines ->.ace, Interactor "WBGene" Variation "WBVar". use name server to map variation to gene, or the file Karen gave you to map variation to gene for variation OA.
    • Effector Transgene_Name autocomplete name, ontology
    • Effector Transgene_Gene autocomplete WBGene, multi-ontology, ->.ace, Interactor "WBGene" Transgene "id". In case of multi genes, WBGene is followed by same transgene id.
    • Effector Other Type dropdown list with 'Chemical' and 'Transgene'
    • Effector Other normal text field
    • Effected Gene autocomplete WBGene, multiontology, corresponding to interactor in .ace file. order does not matter when dump to interactors
    • Effected Variation WBGene, WBVar, multiontology, autocomplete on Variation, store in separate lines ->.ace, Interactor "WBGene" Variation "WBVar". use name server to map variation to gene, or the file Karen gave you to map variation to gene for variation OA.
    • Effected Transgene_Name ontology, autocomplete name Transgene object names ? '''yes, transgene object name, eg iaIs3. -X'''
    • Effected Transgene_Gene multi-ontology, autocomplete WBGene->.ace, Interactor "WBGene" Transgene "id". In case of multi genes, WBGene is followed by same transgene id. One wbgene for each .ace line ? Make sure you really want it this way, we can go with product/promoter if that's what you want, just make sure it's what you want. It matters having extra fields and scrolling and so forth. You'll see when the text fields become multi-ontology and ontology. TODO: when this is confirmed as final, recreate the tables and rename them -- J '''Yes, one WBGene per .ace line. In case of multiple genes, they are followed by same transgene name. -X'''
    • Effected Other Type dropdown list with 'Chemical' and 'Transgene'
    • Effected Other normal text field Note: Gene, Variation, and Transgene_Gene all refer to different genes. There is no pairing problem.
  • Tab 2
    • Curator dropdown list
    • Paper ontology
    • Person multiontology -X
    • RNAi ID free text fiel
    • Phenotype multiontology
    • Remark big text
    • Sentence ID sentence shows in term info
    • False Positive toggle, will not give an id or no dump if the sentence is false positive, containing no interaction info

.ace template for dumping

  • Interaction : ""
  • Interactor "WBGene" Variation ""
  • Interactor "WBGene" Transgene ""
  • Interactor "WBGene"
  • Interaction_type Genetic Effector ""
  • Interaction_type Genetic Effected ""
  • Interaction_type Genetic Non_directional ""
  • Interaction_type Genetic Interaction_RNAi ""
  • Interaction_type Genetic Interaction_phenotype ""
  • Interaction_type Regulatory Effector ""
  • Interaction_type Regulatory Effected ""
  • Interaction_type Regulatory Non_directional ""
  • Interaction_type Regulatory Interaction_RNAi ""
  • Interaction_type Regulatory Interaction_phenotype ""
  • Interaction_type No_interaction Effector ""
  • Interaction_type No_interaction Effected ""
  • Interaction_type No_interaction Non_directional ""
  • Interaction_type No_interaction Interaction_RNAi ""
  • Interaction_type No_interaction Interaction_phenotype ""
  • Interaction_type Predicted_interaction Effector ""
  • Interaction_type Predicted_interaction Effected ""
  • Interaction_type Predicted_interaction Non_directional ""
  • Interaction_type Predicted_interaction Interaction_RNAi ""
  • Interaction_type Predicted_interaction Interaction_phenotype ""
  • Interaction_type Physical_interaction Effector ""
  • Interaction_type Physical_interaction Effected ""
  • Interaction_type Physical_interaction Non_directional ""
  • Interaction_type Physical_interaction Interaction_RNAi ""
  • Interaction_type Physical_interaction Interaction_phenotype ""
  • Interaction_type Suppression Effector ""
  • Interaction_type Suppression Effected ""
  • Interaction_type Suppression Interaction_RNAi ""
  • Interaction_type Suppression Interaction_phenotype ""
  • Interaction_type Enhancement Effector ""
  • Interaction_type Enhancement Effected ""
  • Interaction_type Enhancement Interaction_RNAi ""
  • Interaction_type Enhancement Interaction_phenotype ""
  • Interaction_type Synthetic Non_directional ""
  • Interaction_type Synthetic Interaction_RNAi ""
  • Interaction_type Synthetic Interaction_phenotype ""
  • Interaction_type Epistasis Effector ""
  • Interaction_type Epistasis Effected ""
  • Interaction_type Epistasis Interaction_RNAi ""
  • Interaction_type Epistasis Interaction_phenotype ""
  • Interaction_type Mutual_enhancement Non_directional ""
  • Interaction_type Mutual_enhancement Interaction_RNAi ""
  • Interaction_type Mutual_enhancement Interaction_phenotype ""
  • Interaction_type Mutual_suppression Non_directional ""
  • Interaction_type Mutual_suppression Interaction_RNAi ""
  • Interaction_type Mutual_suppression Interaction_phenotype ""
  • Paper ""
  • Remark ""

Getting Data into OA Step by Step

give directionality in int_nondirectional to all values that have a type in int_type tables

  • checked that all entries with a curator have a type and viceversa done -X
  • checked that all types are valid from list of 11 done -X
  • assigned type as appropriate with timestamp matching the original time the type was entered into postgres. If you want a current timestamp instead, let me know and I'll change it. How shall I check on this? -X You don't check it, it's just if you want the timestamp to have today's timestamp (the timestamp of when we add the data, not literally today) instead of the timestamp of when type went into postgres. You can only see it by doing a psql query from the command prompt or the referenceform.cgi Or it would matter if you later want me to query postgres for some data and you'll have some timestamp restrictions on it, or care when something went into postgres -J OK, got you. I think original time is OK. -X
  • always assign int_nondirectional if there's a type (what do you mean here? type can be directional if they are 'enhancement, suppression...etc' -X If there's no type we don't write to nondirectional table, we write to error file. If there's a type, as appropriate, we write 'Non_directional' or '' <a blank>-J OK. you are right -X) write in blanks as opposed to NULL nor not writing anything.
    • parser on tazendra at
/home/postgres/work/pgpopulation/interaction/20101006_OA_newtables/populate_int_nondirectional.pl

parsing WS220 interaction objects into postgres as well .ace from Gary and Chris:

OBSOLETE script for parsing .ace file

located:/home/postgres/work/pgpopulation/interaction/20101004_xiaodong_start/parse_ace_interaction.pl

OBSOLETE using Wen's interaction .ace source file

located on tazendra at : /home/postgres/work/pgpopulation/interaction/20101004_xiaodong_start/WS220Interaction_SmallScale.ace /home/postgres/work/pgpopulation/interaction/20101004_xiaodong_start/WS220Interaction_To_Read.ace

parsing script is on mangolassi

  • /home/postgres/work/pgpopulation/interaction/20101110_ace_to_int script is parse_ace_interaction_phenote.pl
  • /home/postgres/work/pgpopulation/interaction/20101205_ace_to_OA/parse_ace_interaction_oa.pl is the modified version for writing to OA style data. usage: ./parse_ace_interaction_oa.pl <filename> WBPerson<your_number>-- J
  • source file is WS220Interaction_To_Read.ace for _phenote script, the <filename> if _oa script
  • curator checks for correct WBPerson format, but not that it's a valid curator person.
  • errors are in file: parse_ace_interaction_phenote.err for phenote script, parse_ace_interaction_oa.err for oa script -- J (there is no error as of 11/17/10) (Note: if object(paragraph) has an error, the entire object (paragraph) is skipped...that is if there is an error when you read the file in, just correct the error in the original file and re-run the script.)

special note for future getting Gary and Chris data into OA

  • Reading file created by Igor's script into aceDB
    • you can use the empty database by ssh -X citpub@spica.caltech.edu
    • then cd CitaceMirror
    • then type 'ts' to launch an empty acedb
  • Dumping no-worry .ace file
  • Then parse into OA

What does script (parse_ace_interaction_phenote.pl / parse_ace_interaction_oa.pl) do

  • get the latest pgid from the highest joinkey::integer from int_curator
  • remember the old objects names from int_name
  • read the source file (gary's new interaction objects) paragraph by paragraph
  • it matches the names by: Interaction : \"WBInteraction(\d+)\"/, if it doesn't match, it skips the paragraph; if the name is already in the list, it skips. it removes all the backslashes (\). if it matches 'Non_directional' in paragraph, it will be considered as non_directional. TODO: interaction data from Gary and Chris in .ace format will sometimes have 'split' interaction data for the same object, that is, the interactors will be split into different paragraphs for the same WBInteraction id. see Gary for info if necessary. Done: When reading entries, read them into a hash, with the first line being a key, and every other line being filtered as a key; then putting all entries back together for the rest of the script to loop entry by entry. Tried to test this, but my data is no good, so only got errors, please try with real data, although script doesn't write to postgres yet for testing. -- J tried using test_sample.ace containing Karen's objects with transgenes, and Gary's objects with split paragraph. it run fine. gave no error (after removing comment lines). -X 12/07
  • for directionality : Effector-s are considered geneone and Effected-s are considred genetwo. If it has either it is considered directional.
  • If both directionality and non-directionality coexist it's an error. if neither exists, it's an error.
  • If there's no geneone, look for Non_directional genes. Puts the first in geneone, the rest in genetwo. If there are less than 2 genes, it's an error.
  • concatenate geneone and genetwo with a |, but TODO for OA format convert this to "," format DONE, also checks against gin_wbgene and gives an error if it doesn't match -- J
  • For any paragraph, only one of the following interaction Types can be specified; if no interaction type or multiple types are specified in a paragraph, an error will be given : Genetic No_interaction Predicted_interaction Physical_interaction Synthetic Mutual_enhancement Mutual_suppression Regulatory Suppression Enhancement Epistasis
  • The entry for Interaction_phenotype must adhere to the following "WBPhenotype:#". If a : is missing, it will be added. If there aren't exactly 7 digits it will give an error. It will filter out duplicates. It will concatenate with a | TODO replace | with "," when OA
  • Only one Interaction_RNAi can be specified per paragraph. If more than one is given, an error will occur. (it filters duplicates, allows zero or one)
  • For Remark, filter duplicate and concatenate multiple remarks with two spaces.
  • for Paper, filter duplicates and give error if more than one.
  • It searches through all geneone-s, looking for "<geneone>"<spaces>Variation<spaces>"<something>", and that something becomes variation one TODO make it be in WBVar format when OA DONE checks against ids in obo_name_app_variation, if not, gives an error. Multiple variations for one gene is an error. Different variations for different genes is okay. Joining those with <comma><space> TODO make it "," when OA DONE -- J
  • Transgene tag is almost the same as Variation. TODO see Variation same as variation but getting names from trp_name -- J change the parser so that it not only accepts transgene name but the transgene gene that proceed the transgene name, this transgene gene will be in double quotes. In addition, there could be multiple transgene genes associated with one transgene name. If this is the case, then all transgene genes should be associated with that transgene name, this should not be a problem, because each transgene gene will be treated as separate interactor (with the same transgene name). This will be very rare. An example of a transgene gene that is not being parsed correctly is found in WBInteraction0009222, only the transgene name is shown in OA. -X both the _phenote and the _oa parsers now populates transgene genes, but none of the data I have has multiple genes to test that aspect works -- J Gary made an bogus object (WBInteraction0352014) with two genes associated with same transgene name. see above 'important scripts' section. It got read in OA fine by 'parse_ac_interaction_oa.pl' and dumped correctly by 'oa_interaction_dumper'. I duplicated the object with changing it to 'enhancement' (WBInteraction0352015). it got dumped fine as well.-X 12/13
  • genetwo is the same as geneone for variation and transgene TODO repeat DONE -- J
  • We're getting rid of geneextra / variationextra / transgeneextra, not dealing with it here.
  • X and Gary don't expect to see any #Evidence, so getting rid of checks for Evidence, which containted checks for Person_evidence, Curator_confirmed, and "other".
  • get next highest pgid (from before from int_curator, add 1 to it) and make that the pgid / joinkey.
  • Curator and Non_directional will be written every time (whether Non_directional or <blank>). Any of the other data will be added if there's any data. TODO when changing this script for RNAi make it take WBPerson from command line to be the curator. done -- J
  • Note that transgeneonegene and transgenetwogene are not being read in from .ace file -- J

fixing andrei's missing genetwo

  • file to be fixed is located in mangolassi:

/home/acedb/xiaodong/WS220Interaction_To_Read.ace don't use this file. nothing is going to be fixed in this file. J will suppress errors related to these objects in err file. -X

  • there are 75 objects missing genetwo, andrei's from pgid 9210 to 9283, one object pgid10123. Gary and X are fixing them manually in source file mentioned above, by adding 'Interactor "WBGene" Enhancement Effected "WBGene"'. Gary and Xiaodong looked into these objects. the efftected interactor is actually a transgene with human gene. Since there is no C.elegans gene matching FTDP-12, we decide to put a constraint for dumping these objects as they are, but don't give errors in error file. These are WBInteraction0050670 to WBInteraction0050743.-X X confirms that the .ace looks okay, J is suppressing genetwo errors if the interaction ID is in that range in the .ace dumper -- J
  • Andrei's objects with human gene will get fixed manuanlly when OA go live. Human gene FTDP12 (transgene) will be filled in 'Effector Other' field. -X 12/16

brief q&a

  1. Q:there are 1290 RNAi-based interaction objects that are not existing in phenote table, will get them in (why not to OA table directly? A : Because there'd be no way to check the data until the OA is ready, which we could do first, but then we'd be rewriting the Andrei parser for OA data instead of using the existing one for Phenote data, and check it with Phenote or the OA setup to look at Phenote data. We could do, should we switch everything over first, write the OA, then write the parsers afterward ? Also we'd have to write the obsolete object checker twice, once in converting from phenote to OA, and once in parsing the data directly to OA, instead of writing it once in phenote to OA if data goes through phenote stage first -- J)
  2. I think some objects in current OA (andrei's data) are in WS220 file as well. will they be duplicated? or they overwrite those objects in OA? andrei's data in WS220 might have updated paperID which are invalid papers in 'err.out.20101110'. -- X The script compares all the names in the int_name table with the names in the .ace file. If any names are the same, it skips that .ace paragraph. -J does that mean the objects with updated correct paperIDs won't get read in from .ace to OA, because they have same object names (WBInteractionID) and will get skipped to read? That's fine. I can remember to fix those invalid paperIDs when OA go live later. -X You fixed those in the ToRead.ace file instead of on the postgres table ? Any data in the .ace with an existing name already in the tables will not get read, so if you fixed the .ace for something that won't get read, it won't get read -- J I didn't fix those paperIDs in To_read file. The paperIDs might have been updated by Wen through some scripts automatically each time. I understand those objects won't get read because they have same object IDs as those in postgres table. -X If they were fixed before she dumped it, then those entries would have ended up the To_Read.ace, but you're right you haven't done any manual fixes on those -- J If any entries in the .ace file don't have a matching name, it will get created as a new entry, unless you tell me a way to check the file's entries with the postgres tables to declare they're the same and skipping them. I can't think of any now. -X Off the top of my head, we could skip all entries that have that specific remark and not read those in. actually, shouldn't we do the opposite way, read in those objects with that specific remarks from WS220, since they have updated valid paper ids to replace those same objects but with obsolete invalid paper IDs in phenote/OA now? -X So you want to delete all the old data that exists in the postgres tables from before for names that match in the .ace file ? -J to be more specific, I want to delete all objects with the remark of 'Interaction data was extracted by a curator from sentences enriched by Textpresso. The interaction was attributed to the paper(s) from which it was extracted.' in postgress table and replace them by the same objects in WS220_To_Read.ace. since these same objects in WS220 have updated valid paperIDs. Does that make sense? -X I think so. I still don't know about what specifically you'd like to delete and I'm concerned there's some data that we wouldn't be dealing with. If this is just to avoid fixing the obsolete papers, are there a lot of papers ? I don't know what the ramifications of deleting the PGIDs for all those tables is. In theory it should work, but I felt the same way when you caught that some entries were getting deleted that shouldn't have been. Do you want to just query them through the phenote-like OA and delete them through that, and hope that works out ? -- J TODO: invalid papers will be fixed when OA is live.-X I saw this but figured you wanted it bold to remember later -- J yes, you are right. -X
  3. Q:One thing would be that if there are some fields in the OA that are not in the .ace file, should we delete those as well, or only the data that is being overwritten ? If the former, couldn't we lose data, if the latter, wouldn't we have mixed data ? -- J I think we decided to keep those fields but do not read in OA now. -X I don't understand what you mean by "do not read in OA now" -- J there are five extra tables. extragene and treatments will be mapped to proper table later. extraevidence, extravariation, and extratransgene have no data in them, will be deleted.
  4. Q: The script does a lot of stuff, did we ever sit down and talk about what it did ? If not, we should. I feel like we did, but I could be confusing this with something I did with Karen, or with you about a different datatype. We should go over it because we've changed the way we're dealing with transgene and variation Genes, and I don't know if stuff still makes sense. It would also be good if you document it so that if any entries look wrong (hopefully when you do a .ace dump after reading this data in) you can track why it's not read correctly and let me know how to fix it. Should we talk after the meeting tomorrow ? -- J Yes, we can talk about it tomorrow after meeting. -X sounds good -- J
  5. Q: Gary and Chris's data is going to be in .ace format to keep reading in in batches, right ? If so we'd use this same script since the input would be in the same format. - J in this case, we should change the script back as it is after finishing this particular reading, right? - X We're only now writing the script how should it be different for WS220 and Gary's data ? I'd have thought they were the same since it's all in .ace format, and the script would have to deal with it all the same way - J the script 'parse_ace_interaction_phenote.pl' will be used once for reading WS220_To_Read file into OA, and the modified parse_ace_interaction_oa.pl will be used monthly to get Gary and Chris's data in to OA.

get xiaodong's textpresso data into phenote table

Here's what the move_ggi_to_int.pl parsing script is doing

  • getting data from ggi_gene_gene_interaction
  • convert locus / sequence to wbgene by stripping spaces and comparing to gin_sequence, gin_synonyms, gin_locus (errors if no match)
  • prepending "xiaodong001 : " to sentence ID (this might get replaced later when we figure out how to deal with sentence IDs.
  • getting the paper by capturing match of sentence ID to /WBPaper\d+/ comparing that to pap_status = 'valid' and to pap_identifier to get appropriate paper (if this is not what we want, let me know) (error if no match)
  • match type to Regulatory Suppression Enhancement Epistasis for not-Non_directional, Genetic No_interaction Predicted_interaction Physical_interaction Synthetic Mutual_enhancement Mutual_suppression for Non_directional. (error if no match).
  • always write curator (WBPerson1760) geneone genetwo sentid type nondirectional (even if blank, as opposed to leaving no entry nor entering NULL) paper

parser is at mangolassi

/home/postgres/work/pgpopulation/interaction/20101005_ggi_to_int/move_ggi_to_int.pl

transfer data from phenote table to OA

parser scripts is in mangolassi

  • /home/postgres/work/pgpopulation/interaction/20101117_phenote_to_OA/interactionPhenoteToOA.pl
  • errors are in file: /home/postgres/work/pgpopulation/interaction/20101117_phenote_to_OA/errors_interactionPhenoteToOA
  • ran on mangolassi on 2010 11 29 -- J
  • great job mapping all those entries. The errorfile now only has 4 entries : -- J thanks for your help. Karen fixed in phenote -X
    • 8501 Genetic, nondirectional
    • 8519 Genetic, nondirectional
    • WBGene00000398 -> WBGene00001475
    • WBGene00020676 -> valid gene in wormbase
  • invalid gene mappings -X (11/17)
    • WBGene00003004 -> this is a valid wormbase gene. I don't know why your scripts keep picked it out.
    • WBGene00001219 -> WBGene00003025
    • WBGene00018832 -> WBGene00003929
    • WBGene00004799 -> valid gene in wormbase
    • WBGene00001254 -> WBGene00015981
    • WBGene00007037 -> WBGene00002148
    • WBGene00007040 -> WBGene00000445
    • WBGene00004289 -> WBGene00018285
    • WBGene00004763 -> WBGene00001258
    • WBGene00015314 -> WBGene00002717
    • WBGene00003871 -> WBGene00005744
    • WBGene00009376 -> WBGene00009375
    • WBGene00020734 -> WBGene00044623
    • WBGene00003819 -> WBGene00002889
    • WBGene00016498 -> WBGene00003242
  • bad paper mappings -X 11/18
    • WBPaper00005944 -> WBPaper00005822
    • WBPaper00006145 -> WBPaper00005909
    • WBPaper00006518 -> WBPaper00013396
    • WBPaper00013354 -> WBPaper00006377
    • WBPaper00013357 -> WBPaper00006391
    • WBPaper00013358 -> WBPaper00006388
    • WBPaper00013392 -> WBPaper00024188
    • WBPaper00013397 -> WBPaper00006519
    • WBPaper00013417 -> WBPaper00024228
    • WBPaper00013428 -> WBPaper00024234
    • WBPaper00013429 -> WBPaper00024194
    • WBPaper00013436 -> WBPaper00024207
    • WBPaper00013437 -> WBPaper00024206
    • WBPaper00013438 -> WBPaper00024218
    • WBPaper00013446 -> WBPaper00024213
    • WBPaper00013459 -> WBPaper00024262
    • WBPaper00013464 -> WBPaper00024423
    • WBPaper00013501 -> WBPaper00024307
    • WBPaper00013512 -> WBPaper00024212
    • WBPaper00013518 -> WBPaper00024430
    • WBPaper00013525 -> WBPaper00024211
    • WBPaper00023885 -> WBPaper00024303
    • WBPaper00005694 -> WBPaper00013312
    • WBPaper00006287 -> WBPaper00006247
    • WBPaper00013431 -> WBPaper00024210
    • WBPaper00024499 -> WBPaper00024985
    • WBPaper00024384 -> WBPaper00024898
    • WBPaper00024371 -> WBPaper00024474
    • WBPaper00023910 -> WBPaper00024301
    • WBPaper00024936 -> WBPaper00024670
    • WBPaper00013519 -> WBPaper00024321
    • WBPaper00000962 -> WBPaper00000880
    • WBPaper00024282 -> WBPaper00024451
    • WBPaper00023886 -> WBPaper00024450
    • WBPaper00024938 -> WBPaper00024542
    • WBPaper00025027 -> WBPaper00025132
    • WBPaper00025042 -> WBPaper00025164
    • WBPaper00024702 -> WBPaper00025147
    • WBPaper00024499 -> WBPaper00024985
    • WBPaper00024500 -> WBPaper00024986
    • WBPaper00024928 -> WBPaper00025140
    • WBPaper00024410 -> WBPaper00024876
    • WBPaper00024935 -> WBPaper00025001
    • WBPaper00024332 -> WBPaper00013507
    • WBPaper00024963 -> WBPaper00025151
    • WBPaper00025060 -> WBPaper00025138
    • WBPaper00024565-> WBPaper00024891
    • WBPaper00024701 -> WBPaper00025148
    • WBPaper00024969 -> WBPaper00025114
    • WBPaper00024948 -> WBPaper00024886
    • WBPaper00025015 -> WBPaper00025135
    • WBPaper00013460 -> WBPaper00024263
    • WBPaper00024384 -> WBPaper00024898
    • WBPaper00024362 -> WBPaper00024532
    • WBPaper00013460 -> WBPaper00024263
    • WBPaper00024677 -> WBPaper00025000
    • WBPaper00025043 -> WBPaper00026596
  • bad variation mappings -X 11/18
    • nDp2 -> not a variation anymore, this is a rearrangement -> I fixed = this entry in phenote, removed nDp2 from the variation field and added = the gene cbp-1 as the interactor.=20
    • e63 -> WBVar00296783 (just created)
    • n676 n930 -> WBVar00089665
    • n2245 -> WBVar00296784
    • sc768 ->WBVar00296785
  • there are 22 objects in original phenote data having timestamp "" -O ""2004-03-25_16:58:25_ck1 dangled at the end of remarks. This caused the problem of wrong tag message when read in acedb. X deleted double quotes in tazendra and leave -O stuffs in remarks. X tried to do same with two objects (pgid 8119, and 8120) in sandbox, redumped .ace file, they looked wiered but didn't cause problem anymore. Will let J leave the -O stuffs in remarks. -X 11/22 After fixing (delete quotes in remarks), J repopulated mangolassi from tazendra, populated ggi + ws220, assigned IDs. -11/22 night. X tried the dumper .ace again. All went well except pgid 8299 (WBInteraction0052313) gave error message for unknown tag. checked and found, again, in remark, has extra "". deleted in both sandbox and tazendra. rerun the dumper. everything went through acedb perfectly. -X 11/23 great !

Parser script does:

  • get genes from gin_wbgene, transgenes from trp_name, variations from obo_name_app_variation, phenotypes from obo_name_app_term, person IDs from two_standardname, paper IDs from pap_status
  • These are the 11 valid types : Genetic Regulatory No_interaction Predicted_interaction Physical_interaction Suppression Enhancement Synthetic Epistasis Mutual_enhancement Mutual_suppression
  • checks int_name for <start>WBInteraction<7digits><end>
  • int_nondirectional for <blank> or Non_directional
  • int_type for any of 11 types above
  • int_rnai for <start>WBRNAi<8digits><end>
  • int_phenotype splits on | and checks each term vs list above
  • int_transgeneone int_transgenetwo splits on | and checks each term vs list above
  • int_geneone int_transgeneonegene int_genetwo int_transgenetwogene splits on | and checks each term vs list above
  • int_variationone int_variationtwo splits on <comma><space> and for each term it matches against IDs from list above, or if not against names from list above
  • int_paper against list above
  • int_person against list above

things need to be done later

  • re-match obsolete genes in Andrei's data
  • delete 5 extra tables : (backed up on tazendra at /home/postgres/work/pgpopulation/interaction/20101111_backup_int_treatment_int_geneextra/ for geneextra and treatment, other tables never had any data.)
    • geneextra (12)
    • variationextra (0)
    • transgeneextra (0)
    • otherevi (0)
    • treatment (84)
  • get rid of treatment table, which we don't need.
  • get rid of transgeneextra, variationextra, otherevi, which have no data.
  • X will enter the 12 geneextra after the OA is live.

assign IDs

to xiaodong's objects and to phenotype-based interaction objects without IDs from current pool

  • here's the query to see 531 existing objects without names, if you want to check they should all get names / aren't duplicates of something else : SELECT * FROM int_curator WHERE joinkey NOT IN (SELECT joinkey FROM int_name); -- J on sandbox ran /home/postgres/work/pgpopulation/interaction/20101116_assignIDs/assignIDs.pl to assign IDs to 932 IDs. TODO on tazendra run this when going live X, you should check that the dumper works okay, that you can query / change stuff in the OA and it still dumps okay. The OA still doesn't assign new IDs. That should be all the objects without object names, but if you find any, please let me know. -- J checked. it's cool. -X
  • ids can be queried by 'SELECT * FROM int_name WHERE int_timestamp > '2010-11-16';' in sandbox. eg. pgid 7990 -> WBInteraction0051961

to future objects

  • cronjob will be done daily on assigning WBInteraction ids to objects in OA with:
  • 1. Curator exists and it's anyone besides Arun
  • 3. Interaction ID field is BLANK
  • 4. Interaction Type field '''OR Non_directional field has a value'''
  • 5. There's (an Effector Gene OR an Effector Variation OR an Effector Transgene) AND (an Effected Gene OR an Effected Transgene OR an Effected Variation)

( OBSOLETE

  • 5.a There's an Effector Gene AND an Effected Gene OR
  • 5.b There's an Effector Variation Name AND an Effected Variation Name OR
  • 5.c There's an Effector Transgene AND an Effected Transgene
    • sorry, I have to make the change. regarding to 5 mentioned above, it can be the combination of Effector Gene/Variation/Transgene_Name and Effected Gene/Variation/Transgene_Name. means, as long as there is gene or variation or transgene_name in effector, and gene or variation or transgene_name in effected, the entry should be assigned an id. -X 12/08
    • it should be 'any of the 3 tor + any of 3 ted', means 'tor variation OR tor transgene OR tor gene) AND (ted variation OR ted transgene OR ted gene'. -X changed -- J)

when using OA for curating new object

  1. if "new" is clicked, it will not create nor assign an Interaction ID.
  2. If "duplicate" is clicked it will not create an Interaction ID, but if there was one already, it will duplicate it. So if you want it to get a new one created you should do it through the interaction_ticket.cgi yourself (which is most unlikely case), or _delete the duplicated ID_ andwait for the cronjob to create and reassign.

script is on mangolassi at /home/acedb/xiaodong/assigning_interaction_ids/assign_interaction_ids.pl TODO on tazendra, copy cronbjob to assign interaction IDs /home/acedb/xiaodong/assigning_interaction_ids/assign_interaction_ids.pl and set to crontab 0 4 * * * /home/acedb/xiaodong/assigning_interaction_ids/assign_interaction_ids.pl

start ticket issuer from 0500001 for new entries

  • don't fill the gaps
  • all 932 objects without ids will get id starting from 0500001
  • TODO on tazendra, set a value on int_index to 0500000 so the next one is 0500001 update interaction_ticket.cgi -- J

assign IDs for interaction OA and gene regulation OA on the fly 3/23/2012

  • assign the IDs directly from the code in the OA, which means the OA code and the ticket form code are separate, but serving the same function. So if we ever change the way IDs are assigned on either the ticket form or the OA, we need to change the other form as well.

How the OA dumper works

What dumper does

  • For all distinct names in int_name, get all joinkeys. For each those joinkeys :
  • If Non_directional tag is on, dump with Non_directional tag geneone, genetwo
  • If Non_directional tag is off, dump as Effector geneone, Effected genetwo.
  • If there are genes, dump as :
    • Interactor <gene>
  • If there are Variations, convert to WBGene, and dump as :
    • Interactor <gene> Variation <Variation object>
  • For all the <gene> objects above from the gene or converted transgene / variation, dump as :
    • Interaction_type <type> Effector|Effected|Non_directional <gene>
  • For dead genes: when dumping, find  * Interactor <gene>  * Interaction_type <type> Effector|Effected|Non_directional <gene>, if in %ginDeadMap then replaced if in %ginDead then adding Remark\t\"$g1 is killed or retired\"\n -12/15/2012 X
  • For all the int_rnai : Interaction_type <type> Interaction_RNAi <rnai> Does this make sense as a subsection of <gene>, or is it unrelated ? -J it is unrelated to gene -X
  • For all the int_phenotype : Interaction_type <type> Interaction_phenotype <phenotype> Same it is unrelated to gene. -X
  • For variation, If a variation exists and has no gene mapping, you could get the error twice. 1- variation doesn't map to gene. 2- if there's no geneone nor transgeneonegene, you get the error again for not having a geneone. then it does the same for genetwo. -X
    • get variation->wbgene mappings from term info, if multiple genes use all wbgenes. Data comes from postgres table obo_data_app_variation, which gets updated by Karen once a month when new WS releases.
    • for variation that can't be mapped to any WBgene, suppress the whole objects until it finds the WBGene. This happens because these variations may not be associated with WBGene by Mary Ann yet. these can be found in error file as '9958 has invalid effected variation WBVar00090923, WBVar00296473 no WBGene mapping'. -X
    • for variation, if no corresponding WBGene, the variation will be attached in the remark field as 'Additional/Effector/Effected interactor allele <variation>'. and it will write an error message. -X 12/16
  • For transgene, there is no need of any mapping. Use 'Effector/Effected Transgene Gene' (enter by curator in OA field) for 'Effector/Effected' tag in .ace, like 'Effector <WBGene>'. There is no need of mapping Trangene Name with WBGene. For interactors in .ace, use <WBGene> same as effector/effected <WBGene>, and then followed by Transgene Name, as "<WBGene>"<spaces>Transgene<spaces>"<Transgene_Name>". If you take a look pgid 10217, you may understand what I mean. This is a fake object that I made very complicated and try to fake out all situations for testing. That's great ! -J In 'Effector Transgene Gene' field, I entered 'WBGene00003024'. This is the gene, when dump, I want it to go in Effector tag, as 'suppression Effector "WBGene00003024" ' and in interactor tag, as 'Interactor "WBGene00003024" Transgene "iaIs32"'. Instead, you current dumper (in mangolassi,/home/acedb/xiaodong/oa_interactions_dumper) dumpes like 'suppression Effector "WBGene00001851"' and 'Interactor "WBGene00001851" Transgene "iaIs32"', which mapped iaIs with gene WBGene00001851, and use it for effector and interactor genes. -X Got it, should be fixed now. When we'd changed the .ace parser, we never fixed the dumper, so that's a great catch -- J It is fixed now. I checked the dumper, transgenes are dumped correctly now. thanks! -X Effector/Effected Transgene_gene should be considered as geneone and genetwo. As long as there is/are Transgene Genes, the dumper should not complain missing geneone/two. -X
    • For Effector/Effected Other Type, and Effector/Effected Other (text) fields, the fields will be attached to remark as "Addtitional/Effector/Effected interactor other type (chemical/transgene) text". "Addtitional" is chosen if the object is non-directional, and the Effector/Effected is chosen if directionality is applied from the otherone or the othertwo. The Other Type is optional. -X 12/16
    • write to the errorfile if there is Effector/Effected 'Transgene Name' but no Effector/Effected 'Transgene Gene', and vice versa. Plus, if there's a transgenegene, but not a transgenename, it skips writing any transgene (nor its gene) data to the .ace file. -X
  • It looks for geneone, transgeneonegene, and the genes that map to a given variation one. If it has any of those, there's no error. if there's none of those, there's an error. It does the same for genetwo. -X
  • Also dump paper (check paper is valid, write to errorfile if not). -X
  • Also dump remark
  • object ID comes from int_name field.
  • brief q&a
    • Where to get transgene/variation to wbgene mappings ?-J
      • transgenes can be from transgene OA, an d variation should be from the same source as phenotype OA -X
    • Transgene OA fields Name and Gene, or also Driven_by_Gene or any other fields ? Where in the Phenotype OA for variation ? -J
    • transgene OA field Name. Phenotype OA has a 'Variation' field.-X
    • I understand the transgene and variation part for the transgene OA and the phenotype OA, but I don't know which fields to get the WBGenes from, to get the mappings of Transgene->WBGene and Variation->WBGene, see previous comment about Gene / Driven by Gene / ? for transgene -J
    • now I know what you mean. In interaction OA, please add two new fields 'driven_by_gene' and 'gene' under Effector_Transgene and Effected_Transenge respectively. now don't worry about effector_ and effected_transgene mapping to gene, you will use 'driven_by_gene' or 'gene' to map the WBGene. curators will be responsible to fill either 'driven_by_gene' or 'gene'field. usually only one field is filled, if both are filled, they are the same genes. don't worry about it. curator makes the call of what to fill in these fields after we read the information in term info box for transgene. you just use one of field to map the WBGene. it might be confusing. I will explain you better when you are around.
    • In interaction OA, there should be THREE fields related to Effector/Effected transgene: Transgene, which is for transgene id, now is missing in OA; Transgene_Promoter, which is WBGene, matching Driven By Gene in transgene OA, now is called 'Transgene Drive by Gene' in interaction OA, and Transgene_Product, which is WBGene, matches Gene in transgene OA, now in called 'Transgene Gene' in interaction OA. Currently, interaction OA in sandbox has only later two fields. Could you please add back the first, Effector/Effected Transgene field? -X issues are resolved. refer to another wiki: http://wiki.wormbase.org/index.php/Gene_Interaction and OA request in bitbucket discussion. -X Please clarify the previous description since it doesn't really match the OA. I can change the name of the fields, but it's not clear in this wiki what they should be as there's discussion. Should probably move stuff from this wiki to other wiki, or from other wiki to this wiki. Link between wikis is not very prominent in the middle of a q&a. I'm not doing any work on this now since I can't tell if I should do anything. Let me know what I should do -- J I have copied the following section 'Tab 1' from the other wiki. and many of our related discussion was going on in bitbucket OA requests issue #7. I listed the fields need to be autocompleted as well. Please make OA accordingly and let me know. I will test OA then. -X Are the fields right now labeled as you want them and with the data that you want ? If not, then please point out what's not correct. We can't make the fields autocomplete until we switch from Phenote format to OA format. According to the overall plan at the top, right now we're on the steps of testing dumps ; then adding .ace data and testing dump ; then assigning IDs (and testing dumps) ; and only then can we change from Phenote format to OA format, at which point we rewrite the OA from being a viewer of what's in postgres to a proper OA-data editor. -- J The fields should be labeled and ordered as they are in the following paragraph Tab1. I agree, and think they are, if they're not, tell me what's wrong -- J You should go ahead adding .ace data (by that, do you mean phenote .ace data?) and testing dump (by that, do you mean test .ace dump for future citace upload?) - X I mean the stuff under the section for "parsing WS220 interaction objects into postgres" Did we ever read that data in, or did we just keep going back and forth about it ? We haven't touched the parser since Oct 5, but the To_Read file was changed in Oct 28, what's been happening ? -J we have read the WS220 in, and fixed the errors back and forth a few times, and also on Oct 28, I think, we fixed errors after I came back from trip. we should read in again now. -X perhaps we read them to find errors, but not written them into postgres ? If they were in postgres you should be able to find them querying the phenote-like OA, did you ever do this ? -- J I was wrong. maybe we never got WS220 in. anyway, we have data from phenote, my textpresso data in now as off 11/10 morning. and going to read in WS220 now. -X
    • I've written the .ace dumper, but it's up to you to test it. I'm referring to the section under overall plan, if it's not clear you should either clarify it there, or have an expanded section where it's clear. The .ace dumper is mentioned further down, search for use_package.pl . -J I will test the dumper again after you read in WS220. - X
    • When you tell me that the data in postgres is okay (you can query it through the phenote-like OA) and the dumper is okay, we can move on to the assiging IDs section. The WS220 .ace -> phenote script is on tazendra. The ggi -> phenote script is on mangolassi. We should decide if we want them both on live or sandbox. -- J
  • Tab 1
    • PGID
    • Interaction IDgenerate automatically per entry
    • Non_directional toggle off (default), means effected/effector directional, on (color change by click) means non_directional. actually, by default it's off, meaning it's directional. if you want it "on" by default, we should switch the tag to be Directional, then it would be off by default meaning it's Non_directional - J yes, you are right. I changed the text above. -X
    • Interaction Typedropdown list with 11 types showing in .ace template
    • Effector Gene autocomplete WBGene, multiontology. corresponding to interactor in .ace file. order does not matter when dump to interactors
    • Effector Variation WBGene, WBVar, autocomplete multiontology on variation, store in separate lines ->.ace, Interactor "WBGene" Variation "WBVar". use name server to map variation to gene, or the file Karen gave you to map variation to gene for variation OA.
    • Effector Transgene_Name autocomplete name, ontology
    • Effectot Transgene_Gene autocomplete WBGene, multi-ontology, ->.ace, Interactor "WBGene" Transgene "id". In case of multi genes, WBGene is followed by same transgene id.
    • Effected Gene autocomplete WBGene, multiontology, corresponding to interactor in .ace file. order does not matter when dump to interactors Note that the values now are Effector first, Effected second. Earlier you said this was okay, as it's the way that the data was populated before, gene1 / gene2 being what the data had before, and gene1 was effector and gene2 was effected -J have changed the order in this wiki. -X
    • Effected Variation WBGene, WBVar, multiontology, autocomplete on Variation, store in separate lines ->.ace, Interactor "WBGene" Variation "WBVar". use name server to map variation to gene, or the file Karen gave you to map variation to gene for variation OA.
    • Effected Transgene_Name ontology, autocomplete name Transgene object names ? '''yes, transgene object name, eg iaIs3. -X'''
    • Effected Transgene_Gene multi-ontology, autocomplete WBGene->.ace, Interactor "WBGene" Transgene "id". In case of multi genes, WBGene is followed by same transgene id. One wbgene for each .ace line ? Make sure you really want it this way, we can go with product/promoter if that's what you want, just make sure it's what you want. It matters having extra fields and scrolling and so forth. You'll see when the text fields become multi-ontology and ontology. DONE: when this is confirmed as final, recreate the tables and rename them -- J renamed the transgene-gene tables int_transgeneonegene int_transgenetwogene -- J Yes, one WBGene per .ace line. In case of multiple genes, they are followed by same transgene name. -X

Note: Gene, Variation, and Transgene_Gene all refer to different genes. There is no pairing problem.

For variation mapping to WBGene, in phenotype OA, when you enter variation, WBGene shows up in term info, you can use the same mapping for interaction OA. -X I don't understand this. If the .ace file is going to dump them in the same tag, why split it up into gene and driven_by_gene ? this is a little complicated. transgene is constructed with two parts, driven_by_gene, usually promoter of gene, and followed by gene, usually gene product, GFP or other markers. Interaction can happen between one gene product with another gene promoter (driven_by_gene) or gene product (gene) in a transgene. curators have to make the decision after reading more informations, and fill in one of the boxs in case of driven_by_gene is different with gene product (gene), or both boxs if they are the same. You will use which ever box is filled to map the WBGene. -X Hm, I think we should talk in person tomorrow after the meeting, since I think we're changing a big piece of how this works, is that okay ? --J sure. we will talk then. -X Also, you never answered about whether we need effector/effected for variation and transgene, since they don't reflect in the .ace file (see below in Q&A) -- J sorry, I didn't looked carefully earlier. answered in Q&A now.-X Also, not really for variation, if you see WBVar00000085 for ad1110, it has multiple genes -- J in this case, you will have to dump all associated genes in interactors, effector/effected. can you?-X sure, this seems odd to me, but okay -- J

OA dumper script locates on sandbox

/home/postgres/work/citace_upload/interaction/20101024_oa/get_interaction_ace.pm

call by running : /home/acedb/xiaodong/oa_interactions_dumper/use_package.pl

it creates two files: 'err.out.20101117' and 'interaction.ace.20101117'

Note : there are many papers already invalid. You can find what they should be by going to the paper editor and querying for the 8 digits under identifier. Then when the OA is live you can change the paper IDs.-J will do. thanks. -X there are 166 invalid papers in file 'err.out.20101024' in same location with dumper. when run dumper, go to /home/acedb/xiaodong/oa_interaction_dumper and type ./use_package.pl, output file will be named as 'interaction.ace.DATE_OF_RUNNING'. dumper need to be fixed by deleting 'id' tag. -X okay, I've removed it for now, but it'll be harder for you to find the pgid to tell me which entry is wrong if we need to fix something about it. just query it through the OA, I guess --J

This will take some time to go back and forth with, and testing the dumper works, so I'll wait until after the Japan meeting so that I can have it fresh in my mind as we go back and forth with it. --sure, you can have a break when I am away. -X done

created int_transgenedrivenone and int_transgenedriventwo for transgenes driven by genes. Note that there is no data in these tables since all transgenes have gone into the trangeneone and transgenetwo tables. To test this, add some data in the sandbox.

Dumper now dumps variations and transgenes, but all objects with this data don't have Interaction IDs, so have added fake IDs to pgid 7990 and 8024. Note that 8024 doesn't have a genetwo, which seems wrong, is it ? Karen deleted object 8024 with only one gene. I also tested dumped .ace file. It was read into Acedb smoothly without errors. -X

done: the data for the variations is variation names instead of WBVar IDs so have temporarily made it work off of names, but we should change the data to WBVar IDs later. -- J

To test more entries, query by variation / transgene, add Interaction IDs and test the dumper. Also add transgene driven by gene and test those. I will ask Karen to test on this, since she uses transgenes for interaction the most. -X

Questions and Answers

Please highlight new questions as they're made, and un-highlight the questions as you answer them.

.ace dump to phenote tables questions

  • Q 1 : What to do with the errors from "parse_ace_interaction_phenote.err".

Interaction : "WBInteraction0051152" and Interaction : "WBInteraction0051544" both only have 1 Interactor Gene because the RNAi used is against the same gene as the variation... What I mean is........For example in one case daf-5 RNAi was used in the daf-5 background. So when the original .ace file was written for AceDB it said Interactor "WBGene00001865" Interactor "WBGene00001865" when read in and dumped only Interactor "WBGene00001865" will come out as they are the same. I looked up both these papers and in both cases these are not really interactions and therefore they can be deleted.... So both "WBInteraction0051152" and "WBInteraction0051544" can be deleted.

Going forward, I dont know if we will have cases where the RNAi and the background are the same in terms of interactions, but I assume the OA will be able to deal with this . (Curators: Sometimes people do do RNAi of a gene in the background of that gene for example daf-5 RNAi was used in a specific daf-5 background, but this is usually done to see the strength of the allele and has nothing to do with interactions)).

J questions :

Q : found a typo in the script, Epistatis should be Epistasis. Andrei's data didn't have that entry, but there are entries through phenote (probably ?) that say Epistatis the model says Epistasis, should I change the entries already in postgres to Epistasis ? I've changed the parsing-in script in case I forget later.

A : changed to Epistasis in parser and andrei int_ data

Q : Entries like this one are not getting the genes from the Non_directional subtag, it's populating the geneone, genetwo, geneextra from the Interactors.

  • Interaction : "WBInteraction0009116"
  • Interactor "WBGene00007103"
  • Interactor "WBGene00003055" Variation "WBVar00143019"
  • Epistasis Interaction_RNAi "WBRNAi00077096"
  • Epistasis Non_directional "WBGene00007103"
  • Epistasis Non_directional "WBGene00003055"
  • Epistasis Interaction_phenotype "WBPhenotype:0000022"

A : change parser to look at effector/effected/non_directional for genes, and look at Interactor only for Variation / Transgene.

Q : Also, I don't think we have the directionality marked anywhere in postgres, do we ? Can you see it anywhere in the OA ? If so, does that matter for the dumper, or can we infer it from something ? If not, many entries in postgres are not going to have directionality dumping out. I think Andrei wanted to infer it from the ``type'', but I think you guys said that wasn't good.

A : for interaction data already in postgres, set the non-directional tag to OFF (directional) if the type is one of :

  • Epistasis
  • Suppression
  • Enhancement
  • Regulatory

Q : Andrei also has Regulatory in this list. Do we want that ?

A : yes, we will want 1440 regulatory interactions currently in OA table (phenote table). All his regulatory objects are directional (effector/effected). so you should set the non-directional tag to OFF (directional) if the type is 'regulatory' as well. (got it, will keep -- J)

Q : This is what is in int_geneextra :

  • 6670 | WBGene00023498 | 2008-08-29 13:36:07.218582-07
  • 6671 | WBGene00023498 | 2008-08-29 13:36:07.277077-07
  • 6672 | WBGene00023498 | 2008-08-29 13:36:07.334121-07
  • 6673 | WBGene00023498 | 2008-08-29 13:36:07.394131-07
  • 6674 | WBGene00023498 | 2008-08-29 13:36:07.463157-07
  • 6675 | WBGene00023498 | 2008-08-29 13:36:07.543139-07
  • 6676 | WBGene00023498 | 2008-08-29 13:36:07.619084-07
  • 7379 | WBGene00023498 | 2008-08-29 13:36:43.627074-07
  • 7420 | WBGene00006870 | 2008-08-29 13:36:44.373073-07
  • 7421 | WBGene00006870 | 2008-08-29 13:36:44.431462-07
  • 7987 | WBGene00023498 | 2008-08-29 13:37:08.815725-07
  • 7989 | WBGene00004224 | 2008-08-29 13:37:08.92307-07

A : X will deal with these manually when OA is live.

change parser of .ace -> phenote tables. We need to put those as multiple genes into either geneone or genetwo as appropriate (phenote format with pipes, because the phenote to OA script has to check if those are valid WBGenes anyway)

Done for Effector / Effected but not for interactor. Use Effector / Effected / Non_directional for geneone, genetwo (put nondirectioanal multiple genes, first in geneone, rest in genetwo). Use Interactor for getting Variations and Transgenes.

Q : There are entries with Variation tag, but no data, please fix if needed, or let me know if they're okay. (the variation tag is being ignored for those)

  • Interactor "WBGene00003970" Variation
  • Interactor "WBGene00018572" Variation
  • Interactor "WBGene00001515" Variation
  • Interactor "WBGene00001515" Variation
  • Interactor "WBGene00001515" Variation

A : --Juancarlos has fixed these both in 'To_read.ace' and 'WS220_smallscale.ace' source files. -X

Q : changed parser to set directionality as Non_directional if it has that tag, and <blank> if it has effector / effected (tables are created, they always populate one value or the other). Some entries have both types, see .err file on tazendra.

A : Changed manually into both To_Read and SmallScale files.

Q : Have changed parser to be Effector | Effected | Non_directional , get Var and Transgene from Interactor. Have put nondirectional genes 1 in geneone, rest in genetwo. There's a new error in the .err file about one entry that has two effected and no effector. I don't know why it didn't come up before. It comes from RNAi

A : Gary is out of office today. I will write him a message to have him check on this object and get back to you.

Great, thanks. From his email I've changed that entry to be Suppression Non_directional but re-asked to confirm and emailed him the full entry in case it needs to be revisited. The parser now has no errors and seems ready to read data into postgres. We can do it now, or when you come back and have more time to check the entries, just let me know. The .ace dumper is not ready, so we can't test the dump results. -J

Q : Does it make sense to have pairs of variation one/two transgene one/two if in the dump they only come out as Interactors instead of effector/effected ?

A : yes. in case of non_directional interaction, no effector/effected dumped, only interactors. variation/transgene is always associated with interactors, it doesn't need to come with effector/effected. -X

  • this doesn't make sense to me, if you're saying they don't need to come with tor/ted, why is the answer yes, to whether we should have one/two pairs ? -- J
  • Maybe I misunderstood your question at first. we don't need them in one/two pairs for interactors. -X got it -J
  • NOTE: now that we have a transgenedrivenone and transgenedriventwo (for effector / effected transgenes driven by gene as opposed to just gene), that all the old transgeneone / transgenetwo data was in that set of tables, and none was in driven_by_gene. TODO: Karen will deal with these objects with transgenes manually later when OA is alive. -X I see this but imagine you want it still bold to remember she has to do it ? - J yes, you are right. I wanna remind myself to let her do it when OA can be used. -X

textpresso -> phenote tables questions

Q : Parsing ggi to int (textpresso based to phenote format), I see the data looks like this :

  • joinkey | ggi_paper_sentence | ggi_gene_one | ggi_gene_two | ggi_interaction | ggi_timestamp
  • 1 | WBPaper00028425 : 244 | age-1 | daf-2 | Regulatory | 2009-10-05 12:08:56.777921-07
  • 1 | WBPaper00028425 : 244 | daf-2 | age-1 | Regulatory | 2009-10-05 12:08:56.806375-07

I imagine that means that it's two separate interaction objects from the same sentence, where both genes regulate each other ?

A : yes, each entry is its own interaction object and should get a unique ID.

Q : If that's so, are there any entries where there would be multiple or effected, and if so, would they be in the same cell (like age-1, age-2 ; this is not what the data reflects), or would they be in separate lines, and if they're in separate lines, how do I tell those apart from entries with the same paper_sentence / joinkey that should be separate interactions ?

A : There are not, all entries are in a single row.

Q : There are 819 entries with No_interaction, in this case, we don't record anything, correct ?

A : The form mistakenly didn't allow genes with No_interaction, so we have 819 entries that should have a gene but don't. We'll ignore them now, but in the OA in the future we'll store genes for No_interaction.

Q : We only care about the WBPaper, not the sentence because the sentence number could change on a textpresso re-markup of the corpus, right ?

A : For Xiaodong, store the paper, for Michael store the full sentence. Probably by making the sentence ID a combination of filename - WBPaper - sentence number, then having a file with that filename with WBPaper - sentence number - sentence data. Revisit this when writing OA to see if it makes more sense to store sentenced in postgres. For Xiaodong, show sentence in Term Info. In dataTable either show sentence ID or sentence, either is good.

Q : If a paper is a supplement (e.g. WBPaper00035201.sup.1 : 23 ) we just store the paper right ? (it's an ontology, we can't store that it's in the supplement unless we put it a remark field or something)

A : store paper under paper field, keep whatever matches in the sentence ID field int_sentid

Q : Just to make sure, none of these entries have an object also created from RNAi or some other source, right ? I mean, we're not creating duplicate objects in doing this ?

A : correct.

Q : Again, there's no place that has marked Non_directional or the lack of it, does this matter ?

A : same as .ace stuff, if Epistatic, Suppression, Enhancement, then make it <blank> else make it Non_directional.

Q : Parsing ggi -> int tables, have found some errors, see them in mangolassi at /home/postgres/work/pgpopulation/interaction/20101005_ggi_to_int/move_ggi_to_int.err

genes that don't match may need an a or b after the sequence.

A: two genes in err file have the correct sequence names in wormbase:

F55F8.2 -- WBGene00018890

M01E5.5 -- WBGene00006595

Q : I've manually added those but still have 5 other errors

A : some of genes ARE in wormbase, I wonder why your script didn't match them:

  • T27A3.1 -- WBGene00020838
  • ZC97.1 -- WBGene00022516
  • F11A3.2 -- WBGene00008670
  • Y67D2.1 -- WBGene00022051
  • sprgenes, should be spr-2 -- WBGene00005007 it's weird why 'sprgenes' got picked up from the beginning. -X
  • odd, I've added manual exceptions to those for now. might be that the sandbox is now up-to-date, but since this data set is static, it's okay with the manual exceptions --J Thank you. -X

Q : Some paper don't match because they have been merged into other papers, I'm using the paper they've been merged into, but if you'd rather get an error message here, let me know.

A: thanks. I don't need an error message as long as they get merged into new paper ID automatically.

Q : There are two types not accounted for "Interaction" and "Other_Genetic"

A : they are vague type anyway. please remove those data. thanks. we won't have them in new OA.

Done, I'm excluding all entries where the type is either No_interaction, Other_Genetic, or Interaction. Note that this is probably what made the errors above go down to 5

Q : Here's what the move_ggi_to_int.pl parsing script is doing, please note what it does and appropriately move to its own wiki page or somewhere else in this wiki

  • getting data from ggi_gene_gene_interaction
  • convert locus / sequence to wbgene by stripping spaces and comparing to gin_sequence, gin_synonyms, gin_locus (errors if no match)
  • prepending "xiaodong001 : " to sentence ID (this might get replaced later when we figure out how to deal with sentence IDs.
  • getting the paper by capturing match of sentence ID to /WBPaper\d+/ comparing that to pap_status = 'valid' and to pap_identifier to get appropriate paper (if this is not what we want, let me know) (error if no match)
  • match type to Regulatory Suppression Enhancement Epistasis for not-Non_directional, Genetic No_interaction Predicted_interaction Physical_interaction Synthetic Mutual_enhancement Mutual_suppression for Non_directional. (error if no match).
  • always write curator (WBPerson1760) geneone genetwo sentid type nondirectional (even if blank, as opposed to leaving no entry nor entering NULL) paper

If we want anything else parsed in from textpresso-based gene-gene interactions, please let me know.

A : I have copied what parser is doing to 'Getting data into OA step by step' session too. parser looks good to me. One thing is that we decided to ignore the entries with 'No_interaction' type in current data set. see Q&A above.

That's great (feel free to delete this comment, maybe we need a V : verified tag after answers ?

Q : In transferring the data, I'm using the timestamp the data was originally created in postgres. If you'd rather have the current timestamp instead, let me know (for querying or looking at stuff later)

A : I think we now agree to use the timestamp of when data got into postgress (see above, first section in 'Getting data into OA step by step'

  • transferred textpresso ggi_ data to int_ tables for the OA in the sandbox. used /home/postgres/work/pgpopulation/interaction/20101005_ggi_to_int/move_ggi_to_int.pl but will still need to do in live site, if everything looks okay.
    • genes showed up in numbers only without prefix "WBGene', see pgid 8533, 8534, and 8535. Fixed - J
    • there is nothing shown in 'Remark' field. The remark is supposed to be Interaction data was extracted by a curator from sentences enriched by Textpresso. The interaction was attributed to the paper(s) from which it was extracted.'. I don't know which and where is the .ace source file you used to transfer data. The Remark field should be transfered as well into postgres table and then to OA. -X Fixed. Although it occurred to me that you might want a slightly different remark so that it's later easier to tell apart your entries from Andrei's Or maybe not, up to you - J It is such a good idea. I would like the remark to be "Interaction data was extracted by Xiaodong Wang from sentences enriched by Textpresso. The interaction was attributed to the paper(s) from which it was extracted." -X Done --J

OA function questions regarding to ggi_textpresso pipeline

Q : Will you always go to the next textpresso sentence, or will you want to skip around by querying for a specific ID. If you want to query by ID, how should the ``next sentence'' button work, I mean, how do we determine what the ``next sentence'' is ?

A : Probably (revisit when writing code if it doesn't make sense) store sentences in some obo table (or maybe some generic table since we need to store both sentence_itself and list of genes that matched) for ggi_textpresso_sentences. Maybe store genes in one table and sentences themselves in a different table. For now show both sets in terminfo, but in the future show sentences in term info, and make extra fields textpresso-geneOne and textpresso-geneTwo, which are dropdowns that have only the genes that matched in textpresso. When one of those are selected it doesn't populate textpresso-geneOne/Two, it populates geneone or genetwo directly. So X will have to look at the dataTable to make sure the genes in geneone / genetwo are correct (move the columns to they're side by side and easier to look at). Do mapping of textpresso-matched values to WBGene at the moment of curation, in case the WBGene becomes obsolete or the locus is remapped to a different WBGene in between getting sentences from textpresso, and the sentence being curated by X. In those cases show a message that there was no match to a WBGene, and X will look up the correct gene and manually enter it into geneone or genetwo.

joinkey - data - timestamp format name WBPaper00001234 s123 data <whatever is in the sentence> X only wants to see the sentences, no mark up of genes, no capture of genes, so it can go in generic obo table.

Q: We need to at least talk about how to extract the sentences, I don't remember this. Also, was this a one-time thing ? Then do it again later on ?

A: Last batch of sentences were extracted in October, 2009. The sourcefile is at: /home/postgres/work/pgpopulation/genegeneinteraction/20091002-xiaodong/ggi_20091002 Basically, I do ggi_textpresso sentences by batch. so now I would like to get sentences from next paper after the last paper in 20091002 file to update. Once I finish this batch, I will get new batch again later on. -X You mean we should only use results from papers with ID higher than the ones we've seen, as opposed to look at all the paper IDs we've used and ignore only those ? -- J yes. only use results from paper with IDs later than the previous batch. -X

Q: What do we store, I thought Michael said something about storing something about the version because the markup can change ? or is the date good enough, or having the paperID + full sentence results good enough ? Michael is only interested in my results on if the textpresso extracted sentences are true or false positive. -X Ok, so we'll store the WBPaper, the sentence number, and the sentence itself, like we have in the file. -- J For me, that's all I need. Do you also need to store the results (true or false) for Michael? -X I don't need anything =) If Michael wants something, he should tell you about it, so you can tell me what to do, so you should ask him to make sure everyone's getting what they need from the OA before we go live, although preferable before we work on what it does -- J I talked to Michael. I don't think we need to set up anything special for him, as long as he knows how to query and sort out the results (sentences, true, false) in postgres later. -X 12/02

Q: do we color markup the tagged words ? like gene_celegans one color, regulation another color, some other tags ? that will be nice. but you don't have to. -X Please let me know which tags exist, and what colors to make them (I imagine it's more than just the ones in that example, but I don't recall)-- J gene_celegans ( but not limited to celegans, genes can be from other species as well, right? as long as they are WBGenes), all 11 types we have in OA type field, including 'genetic','regulatory', 'no_interaction' 'physical_interaction', 'suppression', 'enhancement', 'synthetic', 'mutual_suppression', 'epistasis', 'mutual_enhancement'. -Xdo you mean that each of those 11 words are xml tags like <enhancement>some word</enhancement> ? I need to know which xml tags to do anything with, otherwise I'll just display things however it looks in the text file, which the browser will probably suppress because it gets rid of xml markup -- J actually, textpresso markup only has these 3 gene_celegans regulation association. let me know what color you want each of those in the term info -- J gene_celegans in RED, regulation in BLUE, and association in Yellow, please. -X Note that these colours replace the xml tags before populating the term info table, so once the colour is set, it will be harder to change the colour in the future. The yellow is hard to see, but if you're okay with it, cool. -- J Yellow is indeed hard to see. Could you please change it to dark GREEN? Since we don't agree on what we see on pgid 10253 I don't want to repopulate until we work that out -- J Besides, I don't think now in mangolassi, <association> is colored. See pgid 10253, 'C.elegans', which I believe is not a association lexicon, is marked. I don't see anything else in other sentences is marked yellow. Could you please check? -X 12/06 Here's what I see for that pgid, I don't see C.elegans at all. it was my bad. I looked it again, it's "complex' but not 'C.elegans'. I think you are marked it in right color (yellow). it's just hard to tell the word. Please change the 'association' to dark GREEN. -X done -- J(you can look at the source file at /home/postgres/work/pgpopulation/genegeneinteraction/20101130-xiaodong/new_ggi_20101130 and see if it's correct there but wrong on the Term Info, if it's wrong on the file then it's an extraction issue so check Arun's future batch)

  • sentence ID : WBPaper00035228.sup.1 : 3
  • sentence data : D . L . Updike and S . Strome 3 SI TABLE S1 RNAi of SF3b Complex Gene control SF3b5 / 10 SF3b125 SF3b3 / 130 SF3b2 / 145 SF3b14b SF3b1 / 155 Sf3b4 / 49 no homolog C46F11 . 4 phi-6 W03F9 . 10 phf-5 phi-11 sap-49 JA : T08A11 . 2 JA : C08B11 . 5 100 % 100 % yes yes JA : C46F11 . 4 JA : K02F2 . 3 8 % 98 % no yes Worm Homolog RNAi Clone empty vector Embryonic Lethality n = 50 6 % PGL-1 Phenotype no 4 SI D . L . Updike and S . Strome TABLE S2 RNAi of Nuclear Pore Components Target Diffuse PGL-1 in Embryonic Germline Cells ?
  • Gain-of-function <gene_celegans>let-60</gene_celegans> mutations <regulation>extend</regulation> the maximum lifespan of <gene_celegans>daf-2</gene_celegans> mutant animals , but it is not known whether this occurs in a <gene_celegans>daf-16</gene_celegans> dependent manner ( Nanji et al , 2005 ) .

Q: do you want data from supplements ? yes. -X

Q: do we want a sentID field that if you query on blank gives you the next sentence, instead of a ``Next Sentence'' button ? (it would be easier for me, I think, but if it will be confusing for you that you may click it when you don't mean to and get a sentence when you don't want to, that's not good) sentID field only is OK with me. I will remember to get next sentence by query on blank. -X that's great ! I'll work off of that then -- J

Q: what's the workflow then ? You either click the query on the field, where it's blank, so you get the next new sentence in the Term Info, then from it you figure out the paper ID, gene one, gene two, and type ; then you enter them into the appropriate OA fields ? sounds right. -X

Q: Do you still want the extracted genes listed, I'm not sure this makes sense since you can't pick them in a dropdown like before. but we can color mark them so you can see them to type in the autocomplete. I won't need genes list. I will choose the genes myself. color mark is helpful.-X Then I need to know what xml tags should become which colors -- J gene_celegans in RED, regulation in BLUE, and association in Yellow, please. -X

Q: Alternatively, if you want to query out a sentence you've already curated, you have to type into the sentence ID field. And this would be an autocomplete ontology field (on sentence ID, meaning WBPaper# : sentence_number) ? I didn't get you here. this is an alternative of what? -X Sorry, workflow alternative. If you don't query on blank, you'd query on a paper-sentence ID -- J

not a question : you can see all the sentences extracted now on tazendra at : /home/postgres/work/pgpopulation/genegeneinteraction/20101130-xiaodong/ggi_20101130 thanks. -X

1554 results were old WBPapers, 779 are new filtered results. now on mangolassi at /home/postgres/work/pgpopulation/genegeneinteraction/20101130-xiaodong/out -- J the results don't look right. only 779 new results are far fewer that expected. maybe due the textpresso sectioning. X is discussing with Arun. -X 12/3

a lot of sentences don't look so good.. I don't know if it's within the standard false positive stuff. I just used the script we used to get the ggi_20091002 batch. many sentences in ggi_20091002 batch don't look either. I think I should talk to textpresso people to have them improve the text mining or matching or else. I don't think it's your script problem. -X sounds good, I just wanted to make sure they were stuff you'd be okay working with (if only to make as false positive). I've parsed on mangolassi, as mentioned above -- J

  • NOTE : after looking at the code, I no longer think this is a good idea, but I'm leaving it up because we talked about it, and we could potentially go back to this. We'll get textpresso sentences, and append them to two tables that store textpresso-ggi data, and which will be used to display term info and what the next sentence to display should be. maybe a ggi_sentid and a ggi_info table. data being sentid - "1", "WBPaper12345678 s123", info = "1", "The something <gene_celegans>pie-1</gene_celegans> <regulation>regulates</regulation> something else". The sentence ID field should store a ggi_id instead of the sentence ID, and use the ggi_sentid table to map the ggi_id to the sentence ID. The ggi_id is a numeric sequence of the sentences in the order they should be queried. Then when clicking query against blank, look at the highest ggi_id stored in int_sentid, and get the data for the next ggi_id. When querying against an int_id, show the corresponding ggi_info in the Term Info. X, essentially this means that what we're currently calling a sentence ID (WBPaper12345678 s123) is not going to be stored in the interaction table, we'll create and store a mapping of ggi_id-s which will be the sequential IDs so we know what the highest curated sentence is. The ggi IDs will be like the WBGene ID, and the sentence ID will be like the locus. So when you autocomplete in the editor it would say "1 ( WBPaper12345678 s123 ) " for the first sentence with ggi_id 1. Maybe we can put the data in a obo tables instead of special tables, but I'm not sure since the query blank would have a special non-generic function. -- J

I'm having a lot of issues with this right now. can't query on blank. it generally doesn't make sense to, so it's not allowed. we can probably work around that, but maybe we could have one of the ontology values be "new" instead, and query on that ? The problem the would becomes that "new" would get assigned to the field instead of the actual ggi_id, which is wrong. Also if you curate an entry, you'd have to press the reset button each time to clear the table row selection and query for another new sentence, otherwise you'd be assigning "new" to the sentence you just curated, so you'd have to press reset for each sentence, then type "new" then click query, which is a lot of steps.

query buttons query data that is already existing. For NBP entries we deal with that by pre-populating the NBP table, then Karen queries by curator. Maybe we could so something like that instead ? Enter all the textpresso sentences into the interacton OA tables with curator being Arun (that's what Karen does for NBP), then you can query for 10 (or however many you want) Arun entries to get the most recent Arun sentences, which you can then curate and assign the curator to you. We agreed to do this NBP way. -X 12/02

if we do the above, you'd be getting sentences in reverse order, getting the most recent first, but we could deal with that by entering the sentences in backwards order, so that the newer papers go in first, so when you query them in backwards timestamp, you get the oldest papers first. Then we'd have to not enter more sentences until you were done with all of them. -J If I can, I would prefer the sentences in old-paper-first order. means, I would like to curate sentences from old paper first. I would not mind that no more new sentences can be entered until I finish (as you stated in wiki), since I am doing the curation by batch anyway (twice a year). In another word, I will finish one batch (papers from 6 month) first and then do the next batch anyway. -X done -- J

a downside of doing it like this is that all the sentences would have an interaction ID assigned straightaway, and you'd have to remove it when you said it was a FALSE POSITIVE. probably we'd need a NODUMP button and set all new sentences to NODUMP, or change the dumper to not dump entries with curator == Arun (well, Michael, probably, Arun has nothing to do with this). the interaction ID would then be "lost" as a false positive that never gets dumped. Maybe this isn't bad if we don't care about a bunch of holes in the IDs anyway, and there's lots of IDs to use, and there aren't huge amounts of false positives from textpresso (hopefully just some 1000s per year) to treat the textpresso data the same way as the NBP data, meaning it gets assigned in the OA tables, is non-editable, and queryable by a fake curator. And it would be easier to just not assign an ID, that is leave that field blank. Then assign ID's to all 'not false positive / not flagged for NODUMP' character lines by the script before the .ace dump. And these ID needs to be assigned to the character line in postgres at some point. "we'll go with no IDs at all, and have awebform to assign IDs automatically when loaded, that will be calledby a cronjob everyday, but could be called by a curator if they really wanted one right away. -J" -X

I think this method of prepopulating works better with the way that queries work, meaning that they query on something that already exists. This should work with the current OA code architecture (I should have realized earlier that the other method wouldn't work), in that it would be more like the NBP phenotype stuff. Actually, a plus side is that we could prepopulate all the matching WBGenes into the effector and effected fields, but you might not want that because it might be more work to remove a lot of genes from both fields than to just add the genes you want when you want to. I don't need you to prepopulate the WBGenes into the effector/effected fields. I would rather do it manually. -X

With the sentences being already in the curation tables, we can still load the sentences in the term info and show/store just the sentenced ID (WBPaper12345678 s123) in the int_sentid table. Then we can put the sentenceID - sentence itself info directly in obo tables instead of special tables, because we don't need to track sentence order because we get new sentences by querying by timestamp on a curator. Having thought of this, I don't see how we could justify the massive work and complexity of doing it the other way, but we should talk about this -- J Arun is helping to extract sentences using textpresso. He will send xml results to J. At the same time, J is using 20101130 out results to work with other stuffs. -X 12/02

Q : There's 394 entries from 2009-10-05 to 2010-08-17 (Query is SELECT * FROM int_sentid; Data looks like : 8533 | xiaodong001 : WBPaper00028425 : 244 | 2009-10-05 12:08:56.777921-07 ) I think this came from the population of old ggi_ to int_ tables, and I want to point out that those sentid don't have any obo_ data mapped for the sentences. If you really want it we can try to populate it, but since it was already done, I don't think you need it. Also, it says "xiaodong001 : " in front of the identifier, you don't need that for the future sets, right ? I would not need 'xiaodong001' identifier in the future. You don't need to populate those sentid. -X great -- J

Q : In populating the new textpresso sentences, do you want to populate the Paper field ? Upside is that you don't have to fill it in as you curate. Downside is that if someone queries on that paper, they'll see the entry and think it's a real entry / not realize it's uncurated from your pipeline. Don't populate the paper. I will enter the paper when I curate the sentences. _X great -- J

how the script that populates textpresso results to term info (obo_ ) and OA curation (int_ ) works

OBSOLETE: script currently at /home/postgres/work/pgpopulation/genegeneinteraction/20101130-xiaodong/populate_textpresso_ggi_to_OA.pl

script currently deletes all entries from 2010-12-03 forward as we're still testing this. TODO remove these deletions -- J

Populate textpresso data in tazendra OA: done on 20110110 -X

  1. cd to directory on tazendra: /home/acedb/xiaodong/textpresso_ggi
  2. mkdir directionay_name (eg 20110106)
  3. cd directory_name (eg 20110106)
  4. get Arun's result file (35225-35725.txt in the directory)
  5. run script: ./populate_textpresso_ggi_to_OA.pl 20110106/35225-35725.txt WBPerson4793 > 20110106/35225-35725.pg (with first argument file_name as input file, and second argument WBPersonID, then output file)
  6. after running, '20110106/35225-35725.pg' should be in '20111106' directory.

mapping of tags to colours is : gene_celegans -> red, regulation -> blue, association -> dark green. These mappings happen before populating the tables (because the obo tables generically display what's on them, the data has to be parsed to show it should show), so the colours are fixed once they're read in.

  • int_curator table is checked to get the highest pgid.
  • existing data in obo_name_int_sentid is read to compare against new sentence IDs when reading a new batch of data (so that if the same sentence ID is read again, it's skipped)
  • file is read line by line, getting the sentence ID and the sentence data.
  • sentence ID has any ' stripped (shouldn't be any)
  • sentence data has ' escaped to exist in postgres
  • if either sentence ID or sentence data is missing, the sentence is skipped and there's an error message.
  • if the sentence ID already existed in obo_name_in_sentid (see above), the sentence is skipped _without_ and error warning
  • color mapping changes <$tag> to <span style="color: $color;"> and </$tag> to </span>
  • sentence ID and sentence data are added to end of queue for obo_name_int_sentid and obo_data_int_sentid (last in, last out, so they're in order)
  • pgid counter increases by one
  • int_sentid and int_curator (WBPerson4793) with new pgid are added to the beginning of queue (last in, first out, so that they're in reverse order, so that querying for Arun curator [which also does in reverse order] gets the sentences to the OA in order from a double reverse)
  • queue is executed into postgres

mapping of OA fields to postgres tables

  • Interaction ID -> int_name
  • Non_directional -> int_nondirectional
  • Interaction Type -> int_type
  • Effector Gene -> int_geneone
  • Effector Variation -> int_variationone
  • Effector Transgene Name -> int_transgeneone
  • Effector Transgene Gene -> int_transgeneonegene
  • Effector Other Type -> int_otheronetype
  • Effector Other -> int_otherone
  • Effected Gene -> int_genetwo
  • Effected Variation -> int_variationtwo
  • Effected Transgene Name -> int_transgenetwo
  • Effected Transgene Gene -> int_transgenetwogene
  • Effected Other Type -> int_othertwotype
  • Effected Other -> int_othertwo
  • Curator -> int_curator
  • Paper -> int_paper
  • Person -> int_person
  • RNAi -> int_rnai
  • Phenotype -> int_phenotype
  • Remark -> int_remark
  • Gene Extra -> int_geneextra
  • Variation Extra -> int_variationextra
  • Transgene Extra -> int_transgeneextra
  • Other Evi -> int_otherevi
  • Treatment -> int_treatment

TODO sandbox to tazendra

/home/postgres/work/pgpopulation/interaction/20101005_ggi_to_int/move_ggi_to_int.pl > logfile DONE 2011 01 06

(uncomment stuff to uncomment, then run) DONE 2011 01 06 /home/postgres/work/pgpopulation/interaction/20101110_ace_to_int/parse_ace_interaction_phenote.pl > logfile

(comment out the insertions to postgres and check with xiaodong that the IDs it's assigning are okay when doing this live) DONE 2011 01 06 /home/postgres/work/pgpopulation/interaction/20101116_assignIDs/assignIDs.pl > logfile

(uncomment stuff to uncomment, then run) DONE 2011 01 06 /home/postgres/work/pgpopulation/interaction/20101117_phenote_to_OA/interactionPhenoteToOA.pl > interactionPhenoteToOA.pg

add a 0500000 value to int_index so that it starts at 0500001. DONE 2011 01 06 INSERT INTO int_index VALUES ('0500000', '500000', 'WBPerson1760');

update interaction_ticket.cgi DONE 2011 01 06

create obo tables for sentid info. DONE 2011 01 06 /home/postgres/work/pgpopulation/obo_oa_ontologies/create_obo_int_sentid.pl

populate textpresso ggi data for a given batch. script needs to be moved, but at /home/postgres/work/pgpopulation/genegeneinteraction/20101130-xiaodong/populate_textpresso_ggi_to_OA.pl Waiting on new dataset from Arun, need to modify script for new arun format.

copy cronbjob to assign interaction IDs /home/acedb/xiaodong/assigning_interaction_ids/assign_interaction_ids.pl and set to crontab 0 4 * * * /home/acedb/xiaodong/assigning_interaction_ids/assign_interaction_ids.pl DONE 2011 01 06

Updated