Wiki

Clone wiki

enterobase-web / api_download_schemes

Top level links:

Downloading EnteroBase genotyping schemes through the API

Many API users want to fetch the entire catalog of allele profiles and sequences for a given genotyping scheme. Some schemes such as wgMLST are ~1GB and very slow to download walking through the API like other data (e.g. strain metadata). We try to provide daily 'dumps' of the entire database for users to quickly capture the current state of the database.

For users who wish to synchronize with EnteroBase, we recommend a workflow of:

  1. Initially download the daily dump of all data. (.tar.gz)
  2. Append new information by polling EnteroBase at regular intervals through the main REST API.

Step 1. What are the schemes?

A simple request to the schemes endpoints will give you a description of each Scheme in EnteroBase, including a link to the static download for ST profiles. You can use 'only_fields' to just fetch the download link, '?only_fields=download_sts_link'.

#!html#

http://enterobase.warwick.ac.uk/api/v2.0/senterica/schemes?limit=1000
#!json

{
  "Schemes": [
    {
      "created": "2015-08-26T15:04:34.033635+00:00",
      "download_sts_link": "http://enterobase.warwick.ac.uk/schemes/Salmonella.UoW/profiles.list.gz",
      "label": "Achtman 7 Gene MLST",
      "lastmodified": "2015-12-07T17:50:17.186416+00:00",
      "scheme_barcode": "SAL_AA0001AA_SC",
      "scheme_name": "MLST_Achtman",
      "version": 1
    },
.....
  ],
  "links": {
    "paging": {
      "next": "http://enterobase.warwick.ac.uk/api/v2.0/senterica/schemes?orderby=barcode&limit=4&sortorder=asc&offset=4"
    },
    "records": 3,
    "total_records": 13
  }
}

Note that all downloads are redirected to one location (http://enterobase.warwick.ac.uk/schemes) with a scheme and a species (database) specifying a subdirectory.

Subdirectory Scheme description
Salmonella.UoW Salmonella Achtman 7 Gene
SALwgMLST.wgMLSTv1 Salmonella Whole genome MLST (~21K)
SALwgMLST.cgMLSTv1 Salmonella cgMLST version 2
Escherichia.UoW E. coli Achtman 7 Gene
ESCwgMLST.wgMLSTv1 E. coli Whole genome MLST (~21K)
ESCwgMLST.cgMLSTv1 E. coli cgMLST version 2

Step 2. Downloading the ST profile tar ball

If you follow the 'download_sts_link', even in your browser you will be able to download a tar.gz file of the ST profiles.

This is a python snippet that illustrates Step 1 and downloading the tar ball. Remember to write your file as binary ('wb').

#!python
from urllib2 import HTTPError
import urllib2
import base64
import json
import os

SERVER_ADDRESS = 'http://enterobase.warwick.ac.uk'
DATABASE = 'senterica' 
scheme = 'MLST_Achtman' 

def __create_request(request_str):

    request = urllib2.Request(request_str)
    base64string = base64.encodestring('%s:%s' % (API_TOKEN,'')).replace('\n', '')
    request.add_header("Authorization", "Basic %s" % base64string)
    return request

address = SERVER_ADDRESS + '/api/v2.0/%s/schemes?scheme_name=%s&limit=%d&only_fields=download_sts_link' %(DATABASE, scheme, 4000)

os.mkdir(scheme)
try:
    response = urllib2.urlopen(__create_request(address))
    data = json.load(response)
    for scheme_record in data['Schemes']:
        profile_link = scheme_record.get('download_sts_link', None)
        if profile_link:
           response = urllib2.urlopen(profile_link)
           with open(os.path.join(scheme, 'MLST-profiles.gz'), 'wb') as output_profile:
               output_profile.write(response.read())
except HTTPError as Response_error:
    print '%d %s. <%s>\n Reason: %s' %(Response_error.code,
                                                      Response_error.msg,
                                                      Response_error.geturl(),
                                                      Response_error.read())

Step 3. Fetching the Alleles

Step 1 & 2 give you the allele profile (the ST and a vector of allele numbers). The allele sequences are fetched through the Loci endpoint. The same principle applies in downloading the allele sequences tarball.

If you are interested in both the allele sequences and numbers, I would recommend a workflow such as:

  1. Query 'Schemes' for all schemes
  2. For each scheme
    1. Download the ST profile tarball
    2. Query 'Loci' for all Loci in the scheme
      1. Download the Allele sequence tarball

#!html

http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/loci?limit=50&scheme=MLST_Achtman
Would give you results like this:

#!python

{
  "links": {
    "paging": {},
    "records": 2,
    "total_records": 7
  },
  "loci": [
    {
      "database": "Salmonella",
      "download_alleles_link": "http://enterobase.warwick.ac.uk/schemes/Salmonella.UoW/aroC.fasta.gz",
      "locus": "aroC",
      "locus_barcode": "SAL_AA0001AA_LO",
      "scheme": "UoW"
    },
    {
      "database": "Salmonella",
      "download_alleles_link": "http://enterobase.warwick.ac.uk/schemes/Salmonella.UoW/dnaN.fasta.gz",
      "locus": "dnaN",
      "locus_barcode": "SAL_AA0002AA_LO",
      "scheme": "UoW"
    },
.....

Step 4. Keeping in Sync

Once you have the static files you may wish to continue to poll EnteroBase to stay up to date. This could be done with a simple request to alleles - specifying the scheme, locus and number of days since your last update (with a parameter "reldate" for the "relative date") - which will give a list of alleles sequences.

For example, suppose that we are interested in new allele sequences for aroC in the 7 gene MLST scheme for Salmonella in the last 20 days:

#!html

http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/alleles?reldate=20&locus=aroC&limit=50

Alternatively, fetching the new STs in the last 20 days would be a request such as:

#!html

http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/sts?scheme=MLST_Achtman&show_alleles=false&limit=5&reldate=20
which would give you results like this:

#!json

{
  "STs": [
    {
      "ST_id": "3767",
      "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7319AA_ST",
      "create_time": "2017-02-04 05:22:02.847270",
      "info": null,
      "reference": {
        "lab_contact": "public",
        "refstrain": "SAL_QA3953AA_AS",
        "source": "mlst.warwick.ac.uk"
      },
      "scheme": "UoW",
      "st_barcode": "SAL_GB7319AA_ST"
    },
    {
      "ST_id": "3768",
      "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7322AA_ST",
      "create_time": "2017-02-04 07:31:04.645023",
      "info": {
        "lineage": "",
        "st_complex": "61",
        "subspecies": ""
      },
      "reference": {
        "lab_contact": "public",
        "refstrain": "SAL_QA3967AA_AS",
        "source": "mlst.warwick.ac.uk"
      },
      "scheme": "UoW",
      "st_barcode": "SAL_GB7322AA_ST"
    },
    {
      "ST_id": "3769",
      "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7346AA_ST",
      "create_time": "2017-02-07 01:22:57.583287",
      "info": {
        "lineage": "",
        "st_complex": "401",
        "subspecies": ""
      },
      "reference": {
        "lab_contact": "public",
        "refstrain": "SAL_QA4517AA_AS",
        "source": "mlst.warwick.ac.uk"
      },
      "scheme": "UoW",
      "st_barcode": "SAL_GB7346AA_ST"
    },
    {
      "ST_id": "3770",
      "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7347AA_ST",
      "create_time": "2017-02-07 07:50:52.782618",
      "info": {
        "lineage": "",
        "st_complex": "65",
        "subspecies": ""
      },
      "reference": {
        "lab_contact": "public",
        "refstrain": "SAL_QA4540AA_AS",
        "source": "mlst.warwick.ac.uk"
      },
      "scheme": "UoW",
      "st_barcode": "SAL_GB7347AA_ST"
    },
    {
      "ST_id": "3771",
      "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7348AA_ST",
      "create_time": "2017-02-07 10:23:56.025904",
      "info": {
        "lineage": "",
        "st_complex": "205",
        "subspecies": ""
      },
      "reference": {
        "lab_contact": "public",
        "refstrain": "SAL_QA4606AA_AS",
        "source": "mlst.warwick.ac.uk"
      },
      "scheme": "UoW",
      "st_barcode": "SAL_GB7348AA_ST"
    }
  ],
  "links": {
    "paging": {
      "next": "http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/sts?limit=5&offset=5&show_alleles=false&scheme=MLST_Achtman&reldate=20"
    },
    "records": 5,
    "total_records": 33
  }
}

Updated