create wikidata-to-mygene wrapper

Issue #81 resolved
Andrew Su created an issue

need to write a wrapper to execute a wikidata SPARQL query and output a JSON document suitable for import into mygene.info. Proposed JSON object would look something like this (for Entrez Gene ID 1017):

{"1017":
   "wikipedia": {
     "url_stub": "Cyclin-dependent kinase 2"
   }
   "wikidata": "Q14911732"
}

will eventually be committed to https://bitbucket.org/sulab/mygene.hub in mygene.hub / src / dataload / sources / , but code is best written and maintained by team member here.

Comments (5)

  1. Sebastian Burgstaller

    should mouse genes be included in some form here, or is the ortholog property on the human gene item sufficient?

  2. Sebastian Burgstaller

    This is my quick solution, currently, this generates a 5.7 MB file containing a JSON as defined above for all human and mouse genes in Wikidata.

    import PBB_Core
    import urllib
    import json
    
    __author__ = 'Sebastian Burgstaller'
    __license__ = 'AGPLv3'
    
    prefix = '''
        PREFIX schema: <http://schema.org/>
        PREFIX wd: <http://www.wikidata.org/entity/>
        PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    
    '''
    
    entrez_query = '''
        SELECT ?entrez_id ?cid ?article ?label WHERE {
            ?cid wdt:P351 ?entrez_id .
            OPTIONAL {
                ?cid rdfs:label ?label filter (lang(?label) = "en") .
                ?article schema:about ?cid .
                ?article schema:inLanguage "en" .
                FILTER (SUBSTR(str(?article), 1, 25) = "https://en.wikipedia.org/") .
            }
        }
    '''
    
    results = PBB_Core.WDItemEngine.execute_sparql_query(prefix=prefix, query=entrez_query)['results']['bindings']
    
    entrez_map = {}
    for x in results:
        entrez_id = x['entrez_id']['value']
        tmp = {
            entrez_id: {
                'wikipedia': {
                    'url_stub': '',
                },
                'wikidata': x['cid']['value'].split('/')[-1]
            }
        }
    
        if 'article' in x:
            tmp[entrez_id]['wikipedia']['url_stub'] = urllib.parse.unquote(x['article']['value'].split('/')[-1])
        else:
            del tmp[entrez_id]['wikipedia']
    
        if entrez_id in entrez_map:
            tmp = {entrez_id: entrez_map[entrez_id]}
        else:
            entrez_map.update(tmp)
    
    f = open('mygene.info', 'w')
    f.write(json.dumps(entrez_map))
    f.close()
    
  3. Sebastian Burgstaller

    Version with reduced dependencies

    import requests
    import urllib
    import json
    
    __author__ = 'Sebastian Burgstaller'
    __license__ = 'AGPLv3'
    
    prefix = '''
        PREFIX schema: <http://schema.org/>
        PREFIX wd: <http://www.wikidata.org/entity/>
        PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    
    '''
    
    query = '''
        SELECT ?entrez_id ?cid ?article ?label WHERE {
            ?cid wdt:P351 ?entrez_id .
            OPTIONAL {
                ?cid rdfs:label ?label filter (lang(?label) = "en") .
                ?article schema:about ?cid .
                ?article schema:inLanguage "en" .
                FILTER (SUBSTR(str(?article), 1, 25) = "https://en.wikipedia.org/") .
            }
        }
    '''
    
    params = {
        'query': prefix + query,
        'format': 'json'
    }
    
    headers = {
        'Accept': 'application/sparql-results+json'
    }
    
    url = 'https://query.wikidata.org/sparql'
    
    results = requests.get(url, params=params, headers=headers).json()['results']['bindings']
    
    entrez_map = {}
    for x in results:
        entrez_id = x['entrez_id']['value']
        tmp = {
            entrez_id: {
                'wikipedia': {
                    'url_stub': '',
                },
                'wikidata': x['cid']['value'].split('/')[-1]
            }
        }
    
        if 'article' in x:
            tmp[entrez_id]['wikipedia']['url_stub'] = urllib.parse.unquote(x['article']['value'].split('/')[-1])
        else:
            del tmp[entrez_id]['wikipedia']
    
        if entrez_id in entrez_map:
            tmp = {entrez_id: entrez_map[entrez_id]}
        else:
            entrez_map.update(tmp)
    
    f = open('mygene.info', 'w')
    f.write(json.dumps(entrez_map))
    f.close()
    
  4. Log in to comment