create wikidata-to-mygene wrapper
Issue #81
resolved
need to write a wrapper to execute a wikidata SPARQL query and output a JSON document suitable for import into mygene.info. Proposed JSON object would look something like this (for Entrez Gene ID 1017):
{"1017":
"wikipedia": {
"url_stub": "Cyclin-dependent kinase 2"
}
"wikidata": "Q14911732"
}
will eventually be committed to https://bitbucket.org/sulab/mygene.hub in mygene.hub / src / dataload / sources / , but code is best written and maintained by team member here.
Comments (5)
-
-
reporter might as well create this wrapper for all entrez gene IDs in wikidata...
-
This is my quick solution, currently, this generates a 5.7 MB file containing a JSON as defined above for all human and mouse genes in Wikidata.
import PBB_Core import urllib import json __author__ = 'Sebastian Burgstaller' __license__ = 'AGPLv3' prefix = ''' PREFIX schema: <http://schema.org/> PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> ''' entrez_query = ''' SELECT ?entrez_id ?cid ?article ?label WHERE { ?cid wdt:P351 ?entrez_id . OPTIONAL { ?cid rdfs:label ?label filter (lang(?label) = "en") . ?article schema:about ?cid . ?article schema:inLanguage "en" . FILTER (SUBSTR(str(?article), 1, 25) = "https://en.wikipedia.org/") . } } ''' results = PBB_Core.WDItemEngine.execute_sparql_query(prefix=prefix, query=entrez_query)['results']['bindings'] entrez_map = {} for x in results: entrez_id = x['entrez_id']['value'] tmp = { entrez_id: { 'wikipedia': { 'url_stub': '', }, 'wikidata': x['cid']['value'].split('/')[-1] } } if 'article' in x: tmp[entrez_id]['wikipedia']['url_stub'] = urllib.parse.unquote(x['article']['value'].split('/')[-1]) else: del tmp[entrez_id]['wikipedia'] if entrez_id in entrez_map: tmp = {entrez_id: entrez_map[entrez_id]} else: entrez_map.update(tmp) f = open('mygene.info', 'w') f.write(json.dumps(entrez_map)) f.close()
-
- changed status to resolved
I think this is resolved now, just reopen if the code should be adjusted
-
Version with reduced dependencies
import requests import urllib import json __author__ = 'Sebastian Burgstaller' __license__ = 'AGPLv3' prefix = ''' PREFIX schema: <http://schema.org/> PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> ''' query = ''' SELECT ?entrez_id ?cid ?article ?label WHERE { ?cid wdt:P351 ?entrez_id . OPTIONAL { ?cid rdfs:label ?label filter (lang(?label) = "en") . ?article schema:about ?cid . ?article schema:inLanguage "en" . FILTER (SUBSTR(str(?article), 1, 25) = "https://en.wikipedia.org/") . } } ''' params = { 'query': prefix + query, 'format': 'json' } headers = { 'Accept': 'application/sparql-results+json' } url = 'https://query.wikidata.org/sparql' results = requests.get(url, params=params, headers=headers).json()['results']['bindings'] entrez_map = {} for x in results: entrez_id = x['entrez_id']['value'] tmp = { entrez_id: { 'wikipedia': { 'url_stub': '', }, 'wikidata': x['cid']['value'].split('/')[-1] } } if 'article' in x: tmp[entrez_id]['wikipedia']['url_stub'] = urllib.parse.unquote(x['article']['value'].split('/')[-1]) else: del tmp[entrez_id]['wikipedia'] if entrez_id in entrez_map: tmp = {entrez_id: entrez_map[entrez_id]} else: entrez_map.update(tmp) f = open('mygene.info', 'w') f.write(json.dumps(entrez_map)) f.close()
- Log in to comment
should mouse genes be included in some form here, or is the ortholog property on the human gene item sufficient?