Wiki

Clone wiki

gnd / TentativeRequirements

User interaction models

I see 3 modes of usage for the data warehouse, depending on the skill level of the user: user interaction models

Permission levels/authentication

I don't initially have a need for permissions or authentication, but there is strong merit in the framework supporting them. I envisage 4 permission-levels

  • Viewer: able to browse and export data
  • Author: able to submit new datasets
  • Analyst: able to produce new datasets based upon existing ones
  • Maintainer: able to edit metadata (lookups) and delete datasets

Relaxed spatial search

Yes, there's a strong geospatial element to this data. But, no, I don't immediately think I need spatial searching of observations. Hypocrisy, I know. Yes, it may be worth demoting the Geo part of the project name

The reason for this is that it may just be sufficient to retain a record of the bounds of each document/datafile. Thus we relegate the system from searching through all records for items that are in a certain bounds (a typical gis query) into searching for all documents that overlap a certain bounds. We can then elect to do a more detailed search on that subset of the data.

So, we've dropped from querying millions/billions of data-points down to the thousands/tens-of-thousands.

Linked analysis data

In addition to the warehouse containing raw observations, analysts may wish to store calculated datasets, where each record may relate to one or more observations at that time stamp. E.g. the new data may be a the value that the sensor should have recorded at that time, or the new value could the id of which wheel had the most grip.

Analyse contents of data-nuggets

Initially I'd expected the non-common attributes of each data type to be encoded in a data nugget and not utilised by the data-warehouse itself: they would only get unpacked as they are passed to a dedicated analysis package outside the data-warehouse.

But, a design would be welcomed that stored the non-common attributes natively in the database, making them available for analysis by database tools.

A potential solution to this is to have the table of core common attributes, with a supporting table created for each data format that includes the non-common attributes. Thus a View can be generated that combines both tables to provide a virtual view of the original dataset which is available for SQL manipulation.

Version history for some data sets

On occasion data must be 'groomed' prior to analysis. For example a sensor may produce occasional spurious readings. These spurious readings are of great interest to some modes of analysis, but for other analysis a fresh version of the dataset should be produced where the spurious readings are replaced by an interpolated (or other value). Analysis tools exist to calculate the new value, but there would be merit in the data warehouse acknowledging that one dataset is a new version of a prior one.

An illustration of what the user would see is at: https://gomockingbird.com/mockingbird/#nq36az7/zfbYrj

I expect we'd store the following:

  • the date
  • a comment describing the change
  • a link/reference to the previous document (if necessary)

Our search/browse functionality would have to support normally just retrieving the newest version.

There's also a chance that there we may need to support a 'cull' of prior versions of datasets. Once they're 'n'-months old we may ditch prior versions and just keep the final one. This is probably a client-side operation rather than database.

Dataset provenance

There are some situations where two datasets are used to produce a third. For instance, an accurate but intermittent dataset (A) may have the empty periods filled in using a snippets of a less accurate dataset (B) to produce a composite dataset (C). We'd show the user that C came from A and B. When the user is looking at the info page for dataset B, we'd show him that it was used in the production of C.

On occasion, for some operations we wouldn't create C, we'd just create a new version of A. But that won't always be the case. Have a look at the sample flows below:

Data maturing

I guess we'd store the following:

  • the data a dataset was used
  • a comment describing in what way it was used
  • the id and version number of the contributing dataset(s)
  • the id of the new dataset

It's understood that this data may not necessarily be stored all in one place.

So, when we change a document, we include names of contributors in the new version that gets saved. But, beyond this, we wish to be able to see which documents were based on this one (not just the document that produced this one). To support this in the above model, we'd have to make an edit to all the supporting docs when a new one is produced. This is inefficient and wasteful: we don't need to push the whole doc over the pipe, plus it will show that doc as being modified, when in truth its data has not been modified.

Reverse index

The answer to this conundrum is to maintain a reverse index of provenance. So, we'll have a view that's indexed by contributor that stores a list of edits and recipient documents. When we show a doc in detail, we'll find all new docs that were produced from this one, and display them in a list. Cool.

Multi format files

On occasion a sensor will provide multiple streams of data in the same file. The importer process for this format will have to extract a number of different data formats from the file, and put them into the database in their respective forms

Updated