HTTPS SSH

Semantic BioGPS

This web tool enables users to annotate parts of external bioinformatics web pages. The annotations are basically a XPath that is used to crawl all pages for the content inside this "annotation".

Contents

  • File organization
  • API documentation
  • Set up a development environment

File organization

This project is divided in three parts. There is the front end that consist of a single page application with a lot of javascript. There is a JSON API that this application communicates with. Last there is a extraction worker that extracts the annotated content from the web.

Front end

The source for the front end is stored in the client directory. This directory is visible to outside users when in development. This is so that the individual javascript are directly visible for easy debugging. The css subdirectory contains Stylus files, this is a CSS language similar to LESS. The templates subdirectory contains handlebars templates. These files are automatically recompiled on change when the development server runs (with the help of grunt, a javascript build tool. See grunt.js for configuration).

During production is all the files in the client directory compiled into the builtClient directory, which is instead visible to the Internet. The public directory is visible from the outside in both development and production and is where all static assets shared between development and production are stored.

Back end

The source for the API is in the api directory. The main file is server.js although the server is started by running the top level app.js.

Extractor

The extractor code is contained in the extractor directory. For more information, see the file docs/webcrawler.md.

Other directories

The task directory contains some grunt tasks used for compiling assets. In the jobs directory are there a couple of jobs for bootstrapping the database with genes and bioGPS plugins.

API documentation

The tool exposes an REST API. The top resource is a URL template which represent a plugin in BioGPS.

GET /api/urltemplates/

Returns an array of all URL templates.

Sample JSON object:

This will return an array of URL template JSON objects, which can be seen in the next section.

GET /api/urltemplates/[id]

Returns a single URL template by the BioGPS plugin ID.

Sample JSON object:

{
    id   : 66,
    name : "KEGG (human)",
    url  : "http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}",
    _id  : "4feb405b3f14ee284d0000f1"
}

GET /api/urltemplates/[id]/verify?genes=1015,1017

Gets the annotation content from all annotations in one URL template for some specific genes.

Query string parameters:

?genes: Which genes to extract content from.

Sample JSON objects:

An array with annotation JSON objects with a new property: contents.

[
    {
        geneAttribute : "pathways",
        objectType    : "",
        description   : "",
        _id           : "502fed234efe8e030a000001"
        _created      : "2012-08-18T19:29:39.525Z"
        xpath         : "/html/body/div[1]/table[1]/tbody/tr/td/table[...",
        contents      : [
            {
                geneid: "1017",
                content: "hsa04110  Cell cyclehsa04114  Oocyte..."
            },
            {
                geneid: "1015",
                content: ""
            }
        ]
    }
]

GET /api/urltemplates/[id]/download

Get a text file with all currently extracted genes.

GET /api/urltemplates/[id]/annotations

Returns an array of all annotations for a specific URL template.

Sample JSON object:

This will return an array of annotation JSON objects, which can be seen in the next section.

POST /api/urltemplates/[id]/annotations

Creates a new annotations. Responds with the full, newly created annotation as seen in the next section.

GET /api/urltemplates/[id]/annotations/[id]

Get a single annotation.

Query string parameters:

?creator: If set to true, will fill in the _creator property in the JSON object.
?urltemplate: If set to true, will fill in the _urltemplate property in the JSON object.

Sample JSON object:

{
    _creator      : "501fc93001d3bc114f000067",
    _urltemplate  : "5019a2d3108db2fb3b000108",
    description   : "Unstructured text about the function of a gene",
    geneAttribute : "function_text",
    objectType    : "text",
    url           : "http://omim.org/entry/116953",
    xpath         : "//*[@id='results']/table[1]//*[contains(t...",
    _id           : "5020d01bd06cb6ee510001b1",
    _created      : "2012-08-07T08:21:47.603Z",
    _history      : [
                        "501b0bbfba17df683c00002b",
                        "501fefc33357e10e50000016"
                    ]
}

GET /api/urltemplates/[id]/annotations/[id]/history

Returns an array of old versions of this annotation.

Query string parameters:

?limit: The max number of history items to return. Defaults to 200.

PUT /api/urltemplates/[id]/annotations/[id]

Updates a given annotation. Will respond with the updated annotation. Note that the _id-property in the annotation will be updated as well.

DELETE /api/urltemplates/[id]/annotations/[id]

Deletes an annotation. (It is still accessible in the database but flagged as deleted and cannot be reached from the API).

GET /api/data/[geneid]

Gets extracted data from all plugins for a specific gene.

Set up a development environment

To set up your own development environment you first have to make sure you have Node and MongoDB installed. For installing instructions, see Node's and MongoDB's respective homepages.

Clone the project and install it's dependencies:

$ hg clone https://bitbucket.org/sulab/semantic-biogps
$ cd semantic-biogps
$ npm install -d

Either create a virtualenv or install the python dependencies directly:

$ pip install -r requirements.txt

To bootstrap the database with URL templates and genes, run

$ node jobs/all.js

To build the assets and start the server it is easiest to use the Makefile. Run

$ make

to lint the source files, start the server in development mode and start watching for changes to the assets.

Run in production

To run the server in production you first have to build the concatenated and minified production assets:

$ make build

If it is the fist time that you start the application in production you have to bootstrap the database with URL templates and genes, run

$ NODE_ENV=production node jobs/all.js

Start the server with

$ NODE_ENV=production node app.js

Worker

The project contains a worker that is used in the extraction of content from the various bioinformatics web pages.

To run it in development:

$ node worker.js

In production:

$ NODE_ENV=production node worker.js

Tests

The project's test coverage is pretty limited at the moment. They mostly cover some of the helper methods and other pretty trivial cases. But there is infrastructure for extending them. To run the server tests:

$ make test

To run the browser tests, start the server in development mode:

$ make server

and navigate to localhost:8000/specrunner.html