Wiki
Clone wikienterobase-web / api_download_schemes
Top level links:
- Main top level page for all documentation
- EnteroBase Features
- Registering on EnteroBase and logging in
- Tutorials
- Using the API
- About the underlying pipelines and other internals
- How schemes in EnteroBase work
- FAQ
Downloading EnteroBase genotyping schemes through the API
Many API users want to fetch the entire catalog of allele profiles and sequences for a given genotyping scheme. Some schemes such as wgMLST are ~1GB and very slow to download walking through the API like other data (e.g. strain metadata). We try to provide daily 'dumps' of the entire database for users to quickly capture the current state of the database.
For users who wish to synchronize with EnteroBase, we recommend a workflow of:
- Initially download the daily dump of all data. (.tar.gz)
- Append new information by polling EnteroBase at regular intervals through the main REST API.
Step 1. What are the schemes?
A simple request to the schemes endpoints will give you a description of each Scheme in EnteroBase, including a link to the static download for ST profiles. You can use 'only_fields' to just fetch the download link, '?only_fields=download_sts_link'.
#!html# http://enterobase.warwick.ac.uk/api/v2.0/senterica/schemes?limit=1000
#!json { "Schemes": [ { "created": "2015-08-26T15:04:34.033635+00:00", "download_sts_link": "http://enterobase.warwick.ac.uk/schemes/Salmonella.UoW/profiles.list.gz", "label": "Achtman 7 Gene MLST", "lastmodified": "2015-12-07T17:50:17.186416+00:00", "scheme_barcode": "SAL_AA0001AA_SC", "scheme_name": "MLST_Achtman", "version": 1 }, ..... ], "links": { "paging": { "next": "http://enterobase.warwick.ac.uk/api/v2.0/senterica/schemes?orderby=barcode&limit=4&sortorder=asc&offset=4" }, "records": 3, "total_records": 13 } }
Note that all downloads are redirected to one location (http://enterobase.warwick.ac.uk/schemes
) with a scheme and a species (database) specifying a subdirectory.
Subdirectory | Scheme description |
---|---|
Salmonella.UoW | Salmonella Achtman 7 Gene |
SALwgMLST.wgMLSTv1 | Salmonella Whole genome MLST (~21K) |
SALwgMLST.cgMLSTv1 | Salmonella cgMLST version 2 |
Escherichia.UoW | E. coli Achtman 7 Gene |
ESCwgMLST.wgMLSTv1 | E. coli Whole genome MLST (~21K) |
ESCwgMLST.cgMLSTv1 | E. coli cgMLST version 2 |
Step 2. Downloading the ST profile tar ball
If you follow the 'download_sts_link', even in your browser you will be able to download a tar.gz file of the ST profiles.
This is a python snippet that illustrates Step 1 and downloading the tar ball. Remember to write your file as binary ('wb').
#!python from urllib2 import HTTPError import urllib2 import base64 import json import os SERVER_ADDRESS = 'http://enterobase.warwick.ac.uk' DATABASE = 'senterica' scheme = 'MLST_Achtman' def __create_request(request_str): request = urllib2.Request(request_str) base64string = base64.encodestring('%s:%s' % (API_TOKEN,'')).replace('\n', '') request.add_header("Authorization", "Basic %s" % base64string) return request address = SERVER_ADDRESS + '/api/v2.0/%s/schemes?scheme_name=%s&limit=%d&only_fields=download_sts_link' %(DATABASE, scheme, 4000) os.mkdir(scheme) try: response = urllib2.urlopen(__create_request(address)) data = json.load(response) for scheme_record in data['Schemes']: profile_link = scheme_record.get('download_sts_link', None) if profile_link: response = urllib2.urlopen(profile_link) with open(os.path.join(scheme, 'MLST-profiles.gz'), 'wb') as output_profile: output_profile.write(response.read()) except HTTPError as Response_error: print '%d %s. <%s>\n Reason: %s' %(Response_error.code, Response_error.msg, Response_error.geturl(), Response_error.read())
Step 3. Fetching the Alleles
Step 1 & 2 give you the allele profile (the ST and a vector of allele numbers). The allele sequences are fetched through the Loci endpoint. The same principle applies in downloading the allele sequences tarball.
If you are interested in both the allele sequences and numbers, I would recommend a workflow such as:
- Query 'Schemes' for all schemes
- For each scheme
- Download the ST profile tarball
- Query 'Loci' for all Loci in the scheme
- Download the Allele sequence tarball
#!html http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/loci?limit=50&scheme=MLST_Achtman
#!python { "links": { "paging": {}, "records": 2, "total_records": 7 }, "loci": [ { "database": "Salmonella", "download_alleles_link": "http://enterobase.warwick.ac.uk/schemes/Salmonella.UoW/aroC.fasta.gz", "locus": "aroC", "locus_barcode": "SAL_AA0001AA_LO", "scheme": "UoW" }, { "database": "Salmonella", "download_alleles_link": "http://enterobase.warwick.ac.uk/schemes/Salmonella.UoW/dnaN.fasta.gz", "locus": "dnaN", "locus_barcode": "SAL_AA0002AA_LO", "scheme": "UoW" }, .....
Step 4. Keeping in Sync
Once you have the static files you may wish to continue to poll EnteroBase to stay up to date. This could be done with a simple request to alleles - specifying the scheme, locus and number of days since your last update (with a parameter "reldate" for the "relative date") - which will give a list of alleles sequences.
For example, suppose that we are interested in new allele sequences for aroC in the 7 gene MLST scheme for Salmonella in the last 20 days:
#!html http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/alleles?reldate=20&locus=aroC&limit=50
Alternatively, fetching the new STs in the last 20 days would be a request such as:
#!html http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/sts?scheme=MLST_Achtman&show_alleles=false&limit=5&reldate=20
#!json { "STs": [ { "ST_id": "3767", "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7319AA_ST", "create_time": "2017-02-04 05:22:02.847270", "info": null, "reference": { "lab_contact": "public", "refstrain": "SAL_QA3953AA_AS", "source": "mlst.warwick.ac.uk" }, "scheme": "UoW", "st_barcode": "SAL_GB7319AA_ST" }, { "ST_id": "3768", "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7322AA_ST", "create_time": "2017-02-04 07:31:04.645023", "info": { "lineage": "", "st_complex": "61", "subspecies": "" }, "reference": { "lab_contact": "public", "refstrain": "SAL_QA3967AA_AS", "source": "mlst.warwick.ac.uk" }, "scheme": "UoW", "st_barcode": "SAL_GB7322AA_ST" }, { "ST_id": "3769", "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7346AA_ST", "create_time": "2017-02-07 01:22:57.583287", "info": { "lineage": "", "st_complex": "401", "subspecies": "" }, "reference": { "lab_contact": "public", "refstrain": "SAL_QA4517AA_AS", "source": "mlst.warwick.ac.uk" }, "scheme": "UoW", "st_barcode": "SAL_GB7346AA_ST" }, { "ST_id": "3770", "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7347AA_ST", "create_time": "2017-02-07 07:50:52.782618", "info": { "lineage": "", "st_complex": "65", "subspecies": "" }, "reference": { "lab_contact": "public", "refstrain": "SAL_QA4540AA_AS", "source": "mlst.warwick.ac.uk" }, "scheme": "UoW", "st_barcode": "SAL_GB7347AA_ST" }, { "ST_id": "3771", "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7348AA_ST", "create_time": "2017-02-07 10:23:56.025904", "info": { "lineage": "", "st_complex": "205", "subspecies": "" }, "reference": { "lab_contact": "public", "refstrain": "SAL_QA4606AA_AS", "source": "mlst.warwick.ac.uk" }, "scheme": "UoW", "st_barcode": "SAL_GB7348AA_ST" } ], "links": { "paging": { "next": "http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/sts?limit=5&offset=5&show_alleles=false&scheme=MLST_Achtman&reldate=20" }, "records": 5, "total_records": 33 } }
Updated