Wiki

Clone wiki

avro_from_delimited / Home

currently in development, not yet ready for prime time

avro_from_delimited

For use outside hadoop, or even inside hadoop. Allows the for very rapid development of transformations of delimited data to a more generic form of structured data: Avro.

In one sense, this library attempts to be the smallest amount of code possible to accomplish the widest array of the most generic use case. Used in tandem with an integration server such as Apache Camel (with or without Apache ServiceMix), the developer might easily turn the most generic data format into this generic structured data format.

Getting Started

Index of Use Cases

Generic Data?

Can all data become generic? Instead of selecting specific data for ETL into hadoop or the like, can all data become big-able?

Is it possible - or even advantageous - to treat all data as variants of the same serialization format and tools? Avro enables such a move, whether or not one might consider that as advantageous or not.

Bringing your delimited data quickly into this serialization format allows you to test this yourself. Originally designed for hadoop, it produces json records pure byte[] arrays that can be stored in any no-sql, or sql persistence engine.

Go ahead and try:

$ git clone https://datafundamentals@bitbucket.org/datafundamentals/avro_from_delimited.git

This API is originally designed for interaction with machines, not humans. You may interact directly with this API through a browser UI instead of

The API can quickly be adopted to a more developer friendly interface, should you care to fork the code or add another API as a wrapper.

Speed

Very large files may start in development using this system, and convert to more capable systems such as parallelized oozie workflows later in the game. One hour conversions might be acceptable for multi-gig super wide files initially, but you might want to speed them up later.

When such time comes you can either fork and add such features to this code, or subplant your own, or throw it away entirely and start from scratch. But you won't have to, not until you have the rest of your operation running smoothly.

This follows the familiar pattern of avoiding premature optimization. Avro_from_delimited may work acceptably with a lot more files than you might initially guess.

Updated