Wiki

Clone wiki

kafkadb / Home

Introduction

KafkaDB is a Kettle-based framework for migrating data between databases with different structures (ie. different versions of the same application) and has been specifically designed to make transforms highly reusable, making it ideal for module/plugin-based applications such as OpenERP.

The framework uses Kettle and simulates module inheritance similar to what OpenERP, Tryton or other frameworks offer, and making Kettle transformations reusable and easy to use depending on the structure of the target database. It has also been designed with the community in mind, so its modularity also allows developers to easily share the transforms they create.

The basic idea is that users should create a transformation based on table and module. For example, if module X adds field A to a table and module Y adds field B to the same table, two different transforms must be created, but with the advantage that they do not need to know the other exists. Then, KafkaDB will be able to put them together dynamically depending on the structure of the target database. In the following schema you can see how KafkaDB will append your transforms to execute as if they had been designed in the same Kettle job.

Functional description

To make migration easier (by avoiding a large dependency graph on transform execution), KafkaDB converts all constraints to deferred before data loading and restores them once the process has finished.

In the end, this allows transformation reuse and one can easily migrate data between tens of databases whose structures depend on the modules installed in each case. One database will contain module X, another one module Y, another may contain both or another may not contain any of them. It doesn't matter. KafkaDB helps you in this complex job and allows you to share your work.

The migration works with three steps which are explained in-depth in the sections following below:

  • Generation of the configuration file
  • Data extraction, transformation and dump
  • Data loading

Configuration File

KafkaDB uses a configuration file in Config Parser format which is created by kafkadb.py and fills it with all tables of both source and target databases and assigns them the Kettle transformations found in the repository.

Here's an example file:

[party_contact_mechanism]                                                      
source = res_partner_address                                                   
depends =                                                                      
target = None                                                                  
migrate = True                                                                 
delete = True                                                                  
mapping =                                                                      
transformation = party_contact_mechanism.ktr                                   
                                                                               
[party_party]                                                                  
source = res_partner                                                           
depends =                                                                      
target = None                                                                  
migrate = True                                                                 
delete = True                                                                  
mapping =                                                                      
transformation = party_party.ktr                                               
                                                                               
[party_party-party_category]                                                   
source = res_partner_category_rel                                              
depends =                                                                      
target = party_category_rel                                                    
migrate = True                                                                 
delete = True                                                                  
mapping =                                                                      
transformation = party_party-party_category.ktr                                                                              

Table Configuration

The following are the attributes the application accepts for each table entry:

  • migrate [required] : Set to True if table must be migrated. Used to disable transformations that are in progress or are invalid.
  • insert [required]: If it is True this transformation will be insert before existing ones for the same table.
  • depends [optional]: Normally, it is used when this transformation uses IDs that are mapped in the depencies.
  • delete [optional]: Set to True if the target table data must be deleted before loading data
  • mapping [required]: The table that store the mapping of this table IDs. Normally, the transformations copy the ID from source to target to avoid mappings, but if in the target there are data it is necessary to generate a mapping from source IDs to target IDs.
  • transformation [required] : List of transformations if any data transformation must be done
  • end_script [optional]: script (relative to the transformation directory) that will be executed after the transformations.
  • parent [optional]: field name in target that acts as parent (in a parent-child structure). It is used to compute the parent_left and parent_right table fields.
[Table]
source : table name from source //[optional]//
target :  table name from target //[optional]//
depends: List of tables that must be migrate before. //[optional]//
migrate: Boolean field that especifies if table must be migrated or not. 
delete: Boolean field if target table must be deleted before migration.
mapping: Create table with fields source and target to make mapping of difererent tables.
transformation: List of kettle transformation to execute.
insert: Boolean field to make inheritance acts as  //append// or //insert//
end_script: Path to script that will be executed after transformations. //[optional]//
parent: Field name in target that acts as parent to generate the parent_right/left fields. //[optional]//

Searching for Kettle transformations

In order to let kafkadb.py find a transformation you must follow some simple rules:

  • Transformation filename must be the name of the target table.
  • Transformation file must be located under a directory with the same name as the module that it applies to. For example, in the example above, as the payment type information is added by account_payment_extension module on table stock_picking the path for the file is: account_payment_extension/stock_picking.ktr.

kafkadb.py will search for all transformations whose filename equals an existing tablename on the target database and discard (NOT IMPLEMENTED YET) modules which are not installed.

Data extraction, transformation and dump

For each entry in the configuration file, the java application dynamically creates and executes a Kettle transformation to extract data from source database and prepares data to load on target database.

If any transforms were configured for the given entry in the configuration file, those user-defined transforms will be appended just after the source node in the dynamically created transform, as shown in the first schema.

User-defined transforms

In order to create a new transformation some rules must be followed:

  • The path of the transform must look like: module/target_tablename.ktr
  • Transform steps must be placed between two dummy steps: source and target.
  • If another transformation needs to be executed before the one being implemented, a dummy step with the name depends-tablename must be added.

Examples

Here there are some examples of user-defined transforms that KafkaDB will integrate to its dynamically-created ones.

Simple Kettle transformation:

Simple Transformation

Kettle transformation with dependencies:

Example with dependencies

Execution

To execute data extraction, transformation and dump, type:

$ ant run 

and take a look at /tmp. The process should have created:

  • SQL script file to load all data into the target database: /tmp/out.sql
  • Directory containing all data files: /tmp/output
  • All Kettle transformations: /tmp/source-target.ktr

Kettle transformation executed to obtain data:

resulting.png

Data Loading

The last step is loading data into the database. As KafkaDB makes use of deferrable constraints, all data loading must be handled in the same transaction and due to Kettle limitations this forces KafkaDB to generate CSV files instead of uploading directly to the database. This step is handled separately:

python kafkadb.py --set-deferred=true
psql -f /tmp/output.sql target_database
python kafkadb.py --setundeferred=true

Utilities

All utilities are centralized in kafkadb.py application, which also incorporates utilities for transformation development.

  • @./kafka --migrate-module=TARGET_MODULE@: updates the TARGET_MODULE/TARGET_MODULE.cfg file in transformations path
  • @./kafka --make-config@: generates the *migration.cfg* file
  • @./kafka --migrate@: executes the migration (data extraction, transformation and dump)

kafkadb help output:

Usage: kafkadb.py [options]

Options:
  -h, --help            show this help message and exit
  --get-model=MODEL     Returns Model Name given a table
  --get-model-field=FIELD
                        Returns Model given a Field and Table
  --gen-file=MIGRATE    Generate Migration File
  --set-deferred=DEFERRED
                        Makes deferreble target database constraints
  --set-undeferred=UNDEFERRED
                        Makes undeferreble target database constraints

NOTES

  • This project is in alpha state
  • --get-model functionality is only implemented for OpenERP databases
  • Per millorar el rendiment podem executar coses amb varis processos (típicament consules a BD): botó dret a la transformada i *Change number of copies to start*.
    • *Ep!* a l'step anterior assegurar-se que *Data movement...* està a *Distribute* (si està a _copy_ duplicarem valors!)
  • Quan hi ha mapping, el camp *id* el reanomenem al principi a un altre nom de camp (per exmple, openerp_id) perquè no colisioni. Això s'ha de tenir en compte en tota la transformada i transformades que l'extenen.
  • el *select values* abans del *target* per renombrar camps *però al _Meta-data_ perquè si ho poses a _Select & Alter_ les herències no funcionaran.
    • *IMPORTANT!* s'ha d'omplir la columna _Type_ perquè executant des del Spoon (Kettle) no falla però des de l'script sí

FAQ

  • To concatenate two steps which the second one has multiple threads, you must to ensure that the data are *distributed* (instead of *copied*)
  • In the *select* step at the end of a transformation, you must to configure the *Metadata* tab (not the *Select*) because it is compatible with inheritance.

Updated