The Gene Wiki Project


The Gene Wiki project on Wikipedia is an initiative to create a comprehensive review article for every notable human gene. There are currently over 10,400 human genes in the Gene Wiki, and more are added at a steady rate.

We have developed a number of tools to analyze, expand and maintain the Gene Wiki project. While initial development was largely in Java, much of the core code is actively being ported to Python to facilitate use in scripting and ease-of-use.


The following projects all fall under the Gene Wiki umbrella:

  • pygenewiki (this project): code to update and expand Gene Wiki pages and resources. Includes ProteinBoxBot, the GeneWiki API, and GeneWiki Generator (a BioGPS plugin).

  • mediawiki-sync: a Java daemon that copies changes from one MediaWiki installation to another, created to support the Gene Wiki mirror at GeneWiki+.

  • genewiki-miner: code related to information extraction and parsing for many of the papers and analyses we've done on the Gene Wiki.

  • genewiki-commons: Common code used across Java projects (required as a Maven dependency)

  • genewiki-generator: Previous version of this project, written in Java. Provides the ProteinBoxBot and GeneWiki Generator (bioGPS plugin).

This Project

Pygenewiki is a port of the original Java code at genewiki-generator to Python. It provides the new version of ProteinBoxBot (which updates the infoboxes on gene pages) and the backend to the BioGPS plugin (which enables the creation of new articles and infoboxes for any gene).

The API to the Gene Wiki is provided in genewiki.py. It provides a number of methods to perform everything from finding all articles in the Gene Wiki to generating new article and infobox wikitext, as well as launching the ProteinBoxBot.

proteinbox.py The core data structure is contained in proteinbox.py. It encapsulates the information about a gene stored in the infoboxes, and can be created from a given Wikipedia page using the wikitext parser (parsers/wikitext.py) or from mygene.info trhough the parsers/mygeneinfo.py parser.

proteinboxbot.py The operational loop for ProteinBoxBot is contained in proteinboxbot.py. It can be run independently of genewiki.py from the command line, and contains logging and saved-state capabilities.