query wikidata for Gene Wiki Portal stats

Issue #91 resolved
Andrew Su created an issue

On the Gene Wiki Portal, we have a table that lists the "Top Gene Wiki articles", but these haven't been updated since 2011. Would be great if we could replace this static table by a wikidata/lua query...

perhaps an intro intern project...

Comments (3)

  1. Sebastian Burgstaller

    The challenge here is to get the counts for the page visits. As far as I know, these are only available via stat.grok.se/ For our paper, I made the following script to get an updated count of monthly page visits, it should be possible to expand this script so it writes to the Gene Wiki portal page to update the tables.

    import PBB_Core
    import requests
    import urllib
    
    prefix = '''
    PREFIX schema: <http://schema.org/>
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    '''
    
    query ='''
    SELECT ?entrez_id ?cid ?article ?label WHERE {
        ?cid wdt:P351 ?entrez_id .
        ?cid wdt:P703 wd:Q5 .
        OPTIONAL {
            ?cid rdfs:label ?label filter (lang(?label) = "en") .
        }
        ?article schema:about ?cid .
        ?article schema:inLanguage "en" .
        FILTER (SUBSTR(str(?article), 1, 25) = "https://en.wikipedia.org/") .
        FILTER (SUBSTR(str(?article), 1, 38) != "https://en.wikipedia.org/wiki/Template")
    }
    '''
    
    sparql_results = PBB_Core.WDItemEngine.execute_sparql_query(query=query, prefix=prefix)
    
    total_views = 0
    
    for count, i in enumerate(sparql_results['results']['bindings']):
        article = i['article']['value'].split('/')[-1]
        article = urllib.parse.unquote(article)
        print(article)
    
        r = requests.get(url='http://stats.grok.se/json/en/201511/' + article)
    
        article_views = 0
        for day in r.json()['daily_views'].values():
            # print(day)
            article_views += int(day)
            # print(article_views)
    
        total_views += article_views
        # print(total_views)
    
        print(count, 'article views: ', article_views, 'total views: ', total_views, 'mean views: ', total_views/(count + 1))
    
  2. Sebastian Burgstaller

    I just realized that stats.grok.se is basically out of service as of mid January 2016. As a way more powerful alternative, https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI was introduced, making the stats available via a nice REST API, even allowing to define the user agent the page impression came from. This will allow much more detailed stats on the Gene Wiki project.

  3. Sebastian Burgstaller

    It looks like the Scribunto Lua module in Wikipedia currently does not allow to query anything outside Wikidata, which suggests that this cannot be implemented with Lua right now.

    As an alternative, I created gene_wiki_statistics.py (https://bitbucket.org/sulab/wikidatabots/src/34b666b0f03180363f28d52b4d34d4a491132858/reporting/gene_wiki_statistics.py?at=master&fileviewer=file-view-default) It gets all Gene Wiki pages and sums up their daily user statistics over one month. Finally, it sorts the resulting list for top accessed and largest pages and updates the table on the Gene Wiki Portal page. This script can be run each month in order to update this table.

  4. Log in to comment