Clone wiki

gnd / UnderlyingProblem

We have a requirement for a data warehouse. Here's why.

We employ around 20 analysts. Their job is to examine in detail the performance of vehicles and sensors, recorded in trials.

They receive data from trials, and analyse the data in home-made legacy analysis/replay tools. The different analysts work in different subject areas, so they require different tools. When they receive the data they use home-made routines to transform the data. Their files are spread across our data server. Files are getting lost. There is also some duplication of effort when several analysts want to use the same data recording.

We want to streamline this data flow. We want a member of clerical staff to handle the ETL, then store the data centrally in a data warehouse. Analysts will go to a web-site and request data in a format of their choice, they'll download that data and use it in their tools. We also want analysts to be able to have a web-based quick-look at where a vehicle travelled in a trial, plus possibly view a plot of a sensor attribute.

So, there appear to be two challenges:

  1. the structure of the data warehouse - we want it to store lots of different shapes of data, but ultimately have a single locations table for the web-based view
  2. de-skilling the ETL process. I've a feeling the staff-member could be taught simple mapping transformations in something like Talend or Kettle, and I could create 'jobs' for more complex operations.

On the periphery, analysts may start to want to get the warehouse doing some work, doing some stats or analysis itself instead of using their legacy tools. They may also want to do analysis that spans large sections of the data -something that wasn't possible before.