Wiki

Clone wiki

ENMA / modules / workflow

EMPOWERING Analytical Modules / ETL - Standard Workflow

'Return home'

General description of the structure and the different types of subtasks

The module workflow is always commanded by a general Python script which is called tasks.py and it is located inside the root module folder. This script defines the module function and his subtasks workflow, which are executed in a synchronously way.

In the following schema, the general structure is shown.

General structure of an ETL / module

As explained above, all the modules are Python functions. In order to launch the desired modules, these functions have to be added in the Celery task queue. Then, this software execute the different subtasks depending the queue order and the system availability.

In the 'module developers documentation' there is a tutorial of how to create, configure and running modules in the platform. The input and output format for all the developed modules are shown in the 'detailed input / output section'.

In general, all the subtasks are HIVE queries, R scripts or Python scripts. In the next paragraphs, a brief description of each of them is made.

HIVE queries

HIVE is used as a data warehouse to access and query, in a language similar to SQL (HiveQL), the raw or pretreated data stored in HBase or HDFS. The HIVE results generated are saved always as text files in the HDFS. In order to reduce at maximum the input dimensionality on subsequent subtasks, some basic calculations of the initial data are usually made in this queries (e.g. aggregations, products, counts, average calculations).

R scripts

This software is broadly used by data scientists to make sense of data using statistics or machine learning algorithms. In this platform, Rhipe and Rhadoop are used to provide a way to use Hadoop from R. This allows the developers to use all the existing R functions in a big data environment. Usually the input data for this scripts comes from text or sequence files stored in HDFS. The output can be also stored in text or sequence files. This type of subtask is always an individual R script which is called from the general Python script.

Python scripts

Python is the main language of the platform, all sorts of Python algorithms can be implemented. There are some data mining libraries installed in the system, such as Numpy, Pandas or Scipy, and to provide a way to use Hadoop from Python, MRjob library is installed. This type of subtask is also very useful to read and write to MongoDB or HBase.

Workflow example of an analytical module

The OT101 module is a comparative module between similar users. Essentially, the subtasks are:

  1. HIVE query subtask: Read and calculate the daily consumptions for all the contracts of this utility in the desired calculation period
  2. HIVE query subtask: Count the number of daily measures, sum the total consumption and load the needed criteria for each contract and month.
  3. R script subtask: Calculate the monthly average consumption for each of the months and group criteria.
  4. R script subtask: Generate the final results joining each consumption of the contract to the average consumption associated with it.
  5. Python script subtask: Load the results to the short term database.
  6. Python script subtask: Load the results to the long term database.
  7. Python script subtask: Remove all temporary data (auxiliar HIVE tables, HDFS files, old results in the short term database)

Workflow example of an ETL module

The measures ETL is the module which load the measures from the short term database to the long term one. Essentially, the subtasks are:

  1. Python script subtask: Execute all the measures and results delete orders. This process remove from the long term database all the undesired measures and results by the utility.
  2. An iterative process until there are not unloaded measures in the short term database.
    • Python script subtask: Generate a buffer of measures from the short term database. (Normally 1 million of measures)
    • Python script subtask: Load the measures from the buffer to HBase
  3. Python script subtask: Delete the measures already loaded from the short term database.

Updated