Wiki

Clone wiki

gnd / HighLevelReqts

Overview

At the heart of this system will be a database thats stores lots and lots of trials data. This data will almost exclusively be time-stamped, and much of it will have a spatial element. The main data type stored will be vehicle tracks, though a range of types of vehicles will be catered for - and each type typically has different attributes.

Around the database will be an ecosystem that transforms data ready for insertion in the database, and transforms it again on the way out.

For data in the database, it is hoped that a spatial view can be presented to network users (probably OpenLayers view of vehicle tracks).

Beyond these initial thoughts I've also invested some effort in Use Case Scenarios Schema Thoughts, tentative requirements and the Underlying problem.

Concepts

  • Vehicle: something that moves
  • Vehicle track: series of time-stamped locations for that vehicle
  • Sensor: something that can make recordings (light level meter, water flow recorder)
  • Sensor recording: series of time-stamped recordings from a sensor
  • Trial: specific time period under which something is trialled (an experiment), in which vehicles and sensors participate
  • Analysis Tool: piece of software that analyses vehicle tracks or sensor recordings, typically in a proprietary data format

As an illustration of my deep, deep, understanding of NoSQL and documented oriented databases, here's a schema for my data (produced to record my perceived data relationships): schema diag

Note: the data-store may not be structured like the above diagram. It's just a record of names/relationships.

Volumes

It is expected that the system will be required to read in around a million observations per month, extracted from 100 data files.

Capabilities

The system should be able to:

  • take data in a range of data formats
  • encode the data into a common schema
  • store this data into a spatial database
  • display vehicle tracks from the database in a web-browser
  • extract data from database
  • transform extracted data back to native format

Data flows

high level flows

Note: the data-nugget Pack/Unpack processing in the above diagram is not a requirement. It's a suggestion for how to overcome the conundrum of storing data in a range of schemas. Other solutions (such as an attached table per data-type that contains additional fields) are welcome.

Data flow description

Input data

While it would appear that vehicle tracks will be in the same format, a car track is typically in just 2d, whereas an aircraft track will include altitude, but may also contain pitch, roll, yaw attributes.

For sensor data recordings there could be a very wide range of data formats/attributes - though they will all have a time-stamp. For example, an engine recording system will have revs/min, trottle opening, fuel usage.

Spatial Database

It is expected that the database will store at least these concepts:

  • Observations: all of the data observations (all have time, some have location)
  • Recordings: details of original source data-files (type, name, reference)
  • Trials: the named time periods when trials were undertaken
  • Vehicles: expanded detail regarding vehicles (name, type)

The database will be designed such that it is able to provide spatial track views to OpenLayers. It should also provide quick extraction of data by recording-type, trial name, or vehicle name.

Transform (Pack)

This process is a pluggable framework that includes a 'reader' specification for each data type. One new (or modified) data type per month. The transformer will extract the common attributes necessary for the spatial database, then encode remaining fields into a typed data-nugget (probably XML).

Transform (Unpack)

This pluggable framework includes a 'writer' specification for each analysis tool format. It takes the database fields (plus decoded data nuggets where necessary) and produces an output file in the specified format.

Note: the pack/unpack into XML has a shortcoming: the relational database is not able to filter according to these parameters, it's only able to filter data based on the core attributes.

Data output

Many analysis tools only require time and location: core database fields. Other specific tools, however require additional columns. A schema-agnostic application like Excel will just receive a column for each attribute in the selected data, whereas Google Earth may receive color coding according to the value of a specific attribute.

Graphical view

Instead of having to learn/install analysis tools, some users just need a very quick look at the data. To meet this requirement it should be possible to quickly open a browser-based plot of vehicle tracks. Potentially it would also be valuable to quickly view a dataset in tabular form, or to view a plot of one or more variables against time.

Global system constraints

  • I have a preference to using Java, though this is not a formal requirement for this venture.
  • I've installed and played with Talend and Pentaho Kettle. I haven't used them in anger, but as a Java developer I've s subtle preference for the Java-based Kettle. I've also a slight emotional preference for Kettle - Talend seemed overly keen to push me into a commercial version. One year on the Talend product has seemed easier and quicker to use. It sits on top of the Eclipse Framework in which I spend most of my working day - this familiarity goes a long way.

Local system constraints

I'm developing this system with a particular installation in mind. This installation has the following constraints:

  • Cannot rely on Internet connection, or Internet services
  • Windows XP clients (IE8)
  • Windows 2003 Server
  • The system won't have a dedicated db-administrator. I suspect there will just be me. I've some Access/SqlServer/Postgres skills
  • Beyond myself there is a supporting data operator who has client-side .Net development skills.

Updated