HTTPS SSH

avro-tools

A set of tools for working with Avro serialized files.

avro-records.py

Usage:

 avro-records.py [-h] [-n N] {count,keys,sample,cat_keys,cat} ...

positional arguments:

 {count,keys,sample,cat_keys,cat}
   count               Count the number of records in the file.
   keys                List the available keys in the file.
   cat_keys            Print the contents of the values for certain keys of
                       the record.
   cat_sample          Get a sample of the records of the file and print the
                       records on the standard output.
   cat                 Concatenates files and stores the result on an output
                       file. It asumes that all the input files have the same
                       schema.
   sample              Gets a random sample of the records of the inpu files
                       and writes them into another Avro file. It asumes that
                       all the input files have the same schema.

Commands

  1. count- Counts the number of records in a file:

    $ python avro-records.py count file1.avro
    42
    
  2. keys - Gets the name of the record keys:

    $ python avro-records.py keys file1.avro
    id,uri,content_type,size
    
  3. cat_keys - Prints the value of some fields:

    $ python avro-records.py cat_keys id,content_type file1.avro file2.avro
    22,text/html
    24,text/html
    25,text/plain
    
  4. cat_sample - Prints a sample of the records in the file:

    $ python avro-records.py cat_sample 42 file.avro
    (output)
    
  5. cat - Concatenate several files into one:

    $ python avro-records.py cat output.avro input1.avro input2.avro
    
  6. sample - Creates a file with a sample of the records from the input files:

    $ python avro-records.py sample 0.25 output.avro input1.avro input2.avro
    

The codec for the output file can be specified with the --out-codec argument:

$ python avro-records.py cat --out-codec=deflate output.avro input1.avro
input2.avro

$ python avro-records.py sample --out-codec=deflate 0.10 output.avro input1.avro

Requirements

The script is written in Python and it requires the Avro libraries and argparse (included in the standard library since Python 2.7).

If you want to use virtualenv for installing the dependencies, use pip and the provided requirements.txt file:

pip install -r requirements.txt

License

This program is distributed under the Apache License 2.0. See LICENSE for more details.

Authors

Juan Manuel Caicedo Carvajal (http://cavorite.com)

Project website (latest release, issues, etc):

https://bitbucket.org/cavorite/avro-tools