Wiki

Clone wiki

enterobase-web / EnteroBase Backend Pipeline: MetaParser

Top level links:

MetaParser

Overview

MetaParser implements the automated downloading of:

  • the automated downloading of all GenInfo Identifiers (GI numbers) in NCBI with the genus designation Salmonella, Escherichia / Shigella, Yersinia or Moraxella, and the corresponding metadata (via ENTREZ utilities)
  • parsing of the metadata into a consistent, EnteroBase format

In order to accomplish these tasks, MetaParser provides a standalone RESTful service which responds to JSON requests for downloading metadata from NCBI that are associated with short reads archives or assemblies. MetaParser also re-formats the metadata into standard EnteroBase formats.

Given an API command, MetaParser:

  • Downloads metadata associated with genomic SRAs.

or

  • Downloads metadata associated with genomic assemblies.
  • Automatically parses downloaded metadata using the Natural Language Toolkit library in Python, and assigns source information to three EnteroBase categories:
  1. Source Niche
  2. Source Type
  3. Source Detail
  • Automatically parses geographic information using Google Geocoding API, and assign it to five EnteroBase categories:
  1. Continent
  2. Country
  3. first-level administrative division (Province/State)
  4. Second-level administrative division (County/Municipality)
  5. City

API

MetaParser URI

In the examples below, the MetaParser URI is configuration dependent, depending on which system MetaParser runs.

Download metadata with SRA run accession code

Metadata with a given SRA run accession code may be downloaded with the meta query method. An example is provided below of downloading metadata with the SRA run accession codes SRR1664288 and SRR1664287. An HTTP GET request is made to the URL

http://<MetaParser Host>/ET/meta/metadata?run=SRR1664288,SRR1664287

Download metadata with sample accession code

Metadata can also be downloaded given a sample accession code using the meta query method. For example, in order to download the metadata for sample accession code SRS753484 make a GET request to the URL.

http://<MetaParser Host>/ET/meta/metadata?sample=SRS753484

Downloading sets of metadata

It is possible to download sets of metadata using the dump query method. Also, it is possible to download a set of metadata that is most recent for some query. This is useful, for example, when keeping up to data a local archive of some subset of the data. For example, in order to download all metadata released over the last 2 days for the specified taxa (i.e. organism "Salmonella") use the URL below in a GET request. (The "reldate" parameter specifies the number of days in the past that we want data for.)

http://<MetaParser Host>/meta/dump?organism=Salmonella&reldate=2

The above example also illustrates the use of the "organism" parameter which is used in another example below. This example also illustrates the use of fetching the data in chunks - using the "start" and "num" parameters - in order to walk through the data resulting from a query. In order to download the first 10 sets of metadata for the specified taxa (useful for pagination) make a GET request to the URL:

http://<MetaParser Host>/ET/meta/dump?organism=Salmonella&start=0&num=10

Downloading metadata associated with assemblies

The metadata associated with assemblies may also be downloaded using the assembly query method and this can use some of the same parameters. For example, in order to download metadata associated with the first 10 genomic assemblies for the specified taxa (i.e "Salmonella") make a GET request to

http://<MetaParser Host>/ET/meta/assembly?term=Salmonella&start=0&num=10

Return EnteroBase designation for a NCBI source identifier

The EnteroBase designation for a NCBI source identifier may be obtained using the host_format query method. For example, in order to obtain the EnteroBase designation for the NCBI source identifier "tuna" make a GET request to

http://<MetaParser Host>/ET/meta/api/host_format?raw=tuna

Return EnteroBase designation for a Google geographic category

Queries may be made for the EnteroBase designation for a Google geographic category using the geo_format query method. For example, in order to obtain the EnteroBase designation for "London" make a GET request to the URL

http://<MetaParser Host>/ET/meta/api/geo_format?raw=London

Other available methods

Automated assignments can be edited by curators using the api/host_curation and api/batch_curation endpoints.

Updated