Wiki

Clone wiki

gnd / OrientDbTechnicalApplicability

Introduction

This page covers a trial we'll conduct to determine the technical applicability of a candidate database to our system. Whilst the tests described below may have pass/fail conditions, failing tests does not represent a failure of this trial - we're doing it in order to find where the database fails.

The tests were initially conducted against OrientDb, but focus switched to CouchDb, as recorded in MoveToCouchDb

High level

Producing the database to support our target system holds several challenges:

  • import high volumes of data quickly
  • store data in structure that allows hierarchical schemas in order that minimal effort is required to export data in new data types
  • sufficient relations to support SQL-like filtering
  • use of views/indexes for search
  • adopt Database without any other server-side technology: all necessary options provided by REST interface
  • there should not be a performance degradation once the system holds high volumes of data.

Areas to be tested

Import data

  • Ideally the database will load (via REST) a track (in suitable JSON format) of 90000 positions in a second (follow on - tests have shown this isn't achieveable. I think).
  • This test should identify the time take to load a file of 1000, 5000, 10000 positions
  • Further the test should also be run with some kind of Indexing applied to document-level metadata. The indexing should be as required for subsequent query/interrogation.
  • To verify this I think there's merit in generating 20 random files of the correct size that can be passed in.
  • There's the chance that in the future we may wish to store larger datafiles more frequently (a 3Mb file every two minutes, all year around. That's 2Gb per day).

The details of some tests conducted and their results are here.

Hierarchical schemas

  • We have three 2d tracks stored, we also have three 3d tracks. The datafiles will have a "SensorType" parameter, which is "TRACK_2D" or "TRACK_3D". We will query all tracks that are 2d (this should include both sets).
  • The test should also included a filtered export, such as where SENSOR_NAME="SENSOR_1"
  • Such tests should be performed from REST and Java (note, in CouchDb the Java interfaces ultimately call REST anyway. This is probably still worth examining, to identify any time cost incurred by the Java interface)
  • Somehow we wish to have some kind of POST validation to ensure the integrity of the data in the database. A CouchDb validation function could do this. But, I don't think validation functions can refer to any information outside the document - so it's not possible for the document to include a schema-name and the function to retrieve the schema for that name and do a comparison/validation. A workaround may be for the document to contain a schema-name plus a definition of that schema. Then the validation function could check that the rest of the document meets that schema. I guess the schema definition could contain an indication of if x/y coords are required, plus attributes/arrays expected in the properties object. JSON Schemas don't appear to be formally ratified, though I have found a javascript JSON schema checker - maybe this could go into the validation function. On further reading it appears that multiple validation functions can be supplied. So, we could have one for each schema-type/version. The function(s) would check if the incoming doc is of their version, if so they would validate it. I've put some thoughts into the Schema Validation Strategy.
  • I've been warned about the performance cost of validation functions. It's worth us learning about the comparative performance of validated vs non-validated imports.

Transformation strategies

The data model we're ending up using is that we store whole datafiles. Each datafile contains an embedded array of observations. We wish to use REST to handle these datafiles, but we also wish to use REST to extract observations. So, a datafile may represent a track. We can use REST to get the whole object as JSON. I also wish to use REST to get the series of positions in GeoJSON. That's our transformation problem.

In OrientDb there were a range of strategies for overcoming it. In CouchDb it just looks like we use a Map/Reduce transform to create a View. This transform is called on the first GET. So, we just define a transform for each output dataset and CouchDb will do the rest.

There is a further transformation challenge. In addition to being able to export a whole datafile in a range of formats, we wish to be able to export a resampled datafile. Here's the scenario: we wish to show a GPS track in a web-browser. The track has 1000 positions. This will swamp the DOM in a web-browser, we need to only show around 50 positions. So, at a minimum we need to only return every 20th position - to give a coarse outline of the track. A more advanced solution is to apply a Simplify algorithm so that the 20 positions we return are the most significant.

It looks like we have two transforms here - both as show() functions. One converts the document to GeoJSON. The other converts and simplifies the document to produce a trimmed GeoJSON.

Just to prove the breadth of CouchDb I think it's also worth experimenting with exporting a document as CSV. So we write the array names as a header row, then loop through them, separating them by commas. I guess we should also provide the correct export type as CSV.

Use of views/indexes

Document-Stores appear to essentially be flat-file databases. But, my dataset does have some relations: a sensor can be of a particular sensor type. To enable searching/filtering against these multiple concerns I believe CouchDb wishes us to create an index. This index would have one entry per document, but also contain significant related keys from lookup tables (sensors, platforms, trials, matching data types). Ideally, we would also put in some calculated data, such as start/end time for that dataset, plus optionally the TL/BR bounds of the dataset to allow temporal-spatial searches.

Once we've established if indexes are of value to us, we should investigate the comparative speed of indexed vs non-indexed for inserting records and querying data.

REST challenges

As with many applications we're separating the client from the server. But, significantly I have an aspiration of OrientDb being the only server-side component in the architecture.

One of the clients I hope to use is a web-browser, using Ajax (probably GWT). Client-side browser applications experience much higher performance when passing JSON that other data-types. So, whilst we know CouchDb supports JSON we need it to be data suitable for the client. A browser-based mapping application (OpenLayers) expects to receive position data in GeoJSON or GeoRSS. So, I wish CouchDb to be sufficiently versatile to export a series of positions in GeoJSON or a series of time-stamped positions in GeoRSS.

REST API

So, whilst CouchDb has its REST API, we wish to have an API to meet our UI requirements. Ideally, they'll be the same API. If there's a completely different API we may end up requiring a middle layer. Hopefully not though. Here are some thoughts on the API we're after:

  • Categories: we need to GET the platform types, platform name, sensor types, sensor names
  • Availability: we need to GET the periods for which data is held
  • Documents: we need to GET the metadata for documents, filtered by category. We also need to be able to POST new documents.
  • Data: we need to GET the data held within documents - in a range of formats. Such formats should include GeoJSON, GeoRSS, CSV

High volume performance

Holding these high volumes should not noticeable slow down the database. Hey, it doesn't for Amazon ;-) A suitable test for this would be to insert 500Gb (or a similar large volume), then check we can do the tests described in the rest of this trial.

SQL-like filtering

  • Users will need to search/browse to find datafiles of interest. They may wish to do this along one or more Platform, PlatformType, Sensor, SensorType, Trial, start-time, end-time. As the user selects one or more items from one listing, the available items are filtered in the other lists. Yes, this is a UI function, but I wish the backend to provide suitable filtering/querying.
  • I propose this test is implemented by adding a couple of other metadata fields to the document(s). One field could be platform (platform_1, platform_2, ...) The other would be a 'continuous' variable, such as start-time and end-time, so we can try to filter documents within a set period (or which overlap that set period). (Since we're in non-normalised NoSQL I'm happy for the start/end time values to be pre-defined document attributes, rather than calculate them on-the-fly) We can then fire in queries to successively filter the body of documents - simulating "live-search"
  • Note: these searches are only expected to return the id/metadata for matching documents. The user is browse/searching to decide which documents to download/view, so we don't need the array points to be returned.
  • Query speed through a large volume of documents is of value. Whilst the database may contain a high number of data observations these queries will only return document-level metadata (title, platform, start-time, etc). So we'll do a comparative test of 1000 documents with short datasets (5 x,y,time values), and a test of 1000 documents with long datasets (1000 x,y,time values). Hopefully the later instance won't be much slower.
  • NOTE: I've been doing some reading-up on CouchDb. It appears that CouchDb can't do this type of search (it's called Faceted Search) See here. Faceted search is provided by the Lucene-Couch plugin. or one via Elastic Search.

Updated