HDF5 File Structure Studies

This wiki describes some studies of the use of HDF5 to create event-data files for HEP. In these studies, we are evaluating a variety of choices for the organization of run, subrun, and event data in HDF5 files. Different choices are made in the various "experiments".

We are writing a (not yet complete, draft) "usage context" document, describing how these data are typically accessed.

We use standard HDF5 terminology for many items. Thus group means HDF5 group, file means HDF5 file, etc.

List of experiments

Note that some of the experiments here may not yet be written!

In all cases, a C++ data product is translated into an HDF5 group. Such a group will contain one or more HDF5 datasets. In all cases, a file contains output from a single "process". A file is never updated, and we don't copy data products forward from input files to output files.

Experiment 001

Description

This is the organization that most directly models the run/subrun/event structure of HEP data.

When run in MPI mode, each rank is responsible for processing a subset of all events. All the data for a given event are written by a single rank. The current implementation does not separate the generation of data into distinct ranks.

File structure

In this version, a single file contains a top-level group for each run. The run group contains:

groups that represent run-level data products
groups that represent subruns

A subrun group contains:

groups that represent subrun-level data products
groups that represent events

An event group contains:

groups that represent the output of specific reconstruction modules.

Each module group contains a dataset that represent event-level data products. In a more complex example, some of the objects stored in a module group would be groups that contain several datasets, representing complex event data products.

A top-level group could be used to contain metadata for the process, including an attribute for the process name. Program configuration data could be stored in such a group as well. This level of detail is no covered in this experiment.

File level (i.e. Results) products could be stored in another top-level group. This detail is also not covered in this experiment.

Experiment 002

Description

Dividing events up between available ranks (for parallel operation), write extensible datasets to a single HDF5 file, storing extent information in an external DB.

File structure

An HDF-format file containing:
- A scalar dataset recording the process name.
- An extensible dataset for each unique data product type / module label / instance name combination.
An SQLite DB file containing tables:
- DP_INFO
  - ID INTEGER PRIMARY KEY
  - Name TEXT
  - ModuleLabel TEXT
  - InstanceName TEXT
- EV_INFO
  - ID INTEGER PRIMARY KEY
  - Run INTEGER
  - SubRun INTEGER
  - Event INTEGER
- EXTENTS
  - ID
  - DP_ID NOT NULL REFERENCES DP_INFO (ID)
  - EV_ID NOT NULL REFERENCES EV_INFO (ID)
  - Begin NOT_NULL
  - End NOT_NULL

Evaluation, notes and questions

Q What are the rules on extending datasets in a parallel context?
A Extending of datasets is a collective operation i.e every rank must extend every dataset, not just the ones it writes. This experiment, while working really quite well in a serial mode, is therefore non-functional under MPI / parallel writing.
Q Is there a way to encapsulate the metadata in the HDF5 file? In parallel in any way?
A Not answered by this excercise, but it appears that "region references" may be stored as metadata associated with the dataset. There may be scaling issues associated with this solution, however.
Relatively easy to extend to subrun-level products, run-level products and "file-level" results by leaving Event, SubRun and/or Run unset in EV_INFO.

Experiment 003

Description

Writing datasets to separate files, each from a single rank. Write extent information as metadata containing region references associated with each dataset, or alternatively as a standalone dataset in the matching file.

File structure

One file per unique data product type / module label / instance name combination.
Record extent information as dense attributes of the data product by event, subrun and run, or as a dataset of { r, sr, e, ref } structures.

Evaluation, notes and questions

As a further experiment (004, say), one could reduce the duplication of event data (normalize), by using MPI to coordinate the writing of run / subrun / event information to the master file, and storing the extent information in each dataset using only an entry number, thus reducing the total amount of metadata across a fileset to approximately a factor 3.
A likely bug was discovered in h5py, which matches an existing issue in the h5py bug tracker. High-speed, "dense" attributes are currently unusable and non-dense attributes are extremely slow.

Experiment 004

Description

This experiment requires MPI.

A sequence is a series of producer modules that work together to do some task (e.g. calibration of hits, or tracking).

One set of N MPI generator ranks for each sequence being simulated. Each creates a configurable number of data products of several types. The number of sequences simulated is configurable. The number of datatypes in each sequence is configurable.
One MPI writer rank for each sequence being simulated. Each writer rank writes a single data file.

Each sequences should be approximately the same size and complexity, to help keep the work reasonably balanced.

The diagram below shows an example of how the MPI ranks are organized. The reconstruction ranks are generating data. Each reconstruction rank is assigned a task, and each rank performing the same task is creating the same kinds of data products. The output ranks are writing data to files. Each output rank is handling some specific set of data product types, a subset of those associated with a given task, and receiving data from all reconstruction ranks running that task. Each output rank is writing to an HDF file shared with no other rank. In the example diagram, task A modules are making product 1 and product 2; one output rank is writing product 1 and another is writing product 2. The diagram also includes task Z ranks, producing product 3 through product n, and an output rank writing each of these products.

File structure

Each file is written to by a single rank, and so MPI-IO is not needed. Each file contains one or more datasets representing data products, as well as datasets that contain the indexing information, using range references.

Evaluation, notes and questions

On a mid-2012 Mac OS X machine with 8 ranks in use:
- With NSEQ=3, (3 writers, 5 generators shared 2-2-1), real/user/sys is reported as 7.84/28.35/18.46.
- With NSEQ=4, (4 writers, 4 generators shared 1-1-1-1), real/user/sys is reported as 7.82/36.74/12.95
- With NSEQ=2, (2 writers, 6 generators shared 3-3), real/user/sys is reported as 7.70/20.49/21.76.
Extent information for each sequence is written to the file containing the data for that sequence.
In order to avoid problems with screen output races, the screen output from only one sequence is written to a file and compared for the purposes of the test.

Experiment 005

Description

This experiment requires MPI. The work assignments for ranks is shared with experiment 004.

File structure

A single file is written by the whole program, using MPI-IO. A top-level group exists for each module; within these groups are datasets for the data products, and datasets representing the index information as range references.

Experiment 006

Description

Per experiment 004, with the following refinement:

The collection of extent data by a separate writer, in which file is also contained a link to every dataset.

As with experiment 004, no use of MPI-IO is needed.

File structure

N writers:
- N-1 writers will write data files containing data from one sequence.
- One writer will write a "master" file containing extent metadata for referencing data in all runs, subruns and events in addition to external links to data products in other files, grouped by sequence.

Evaluation, notes and questions

Q Is it possible to send an existing region reference to the master file writer and have it correctly refer to the corresponding external link to the dataset in the master file?
A No. Extent data must be sent as start / end points to the master writer.
A problem has been discovered in the writing of general lookup data for the data products: in h5py it appears to be impossible to obtain a reference to a dataset which is an external link.

Experiment 007

Description

Per experiment 006, with the following refinement:

The ability to write more than one sequence to a file. (Multiple reconstruction ranks sending to one output rank.)
The ability to spread a sequence across more than one file. (One reconstruction rank sending to more than one output rank.)

File structure

Per experiment 006 except that each non-master writer may write data from one or more sequences, and data for one sequence may be spread over more than one file.

Wiki

hdfFileStructureStudy / Home

HDF5 File Structure Studies

List of experiments

Experiment 001

Description

File structure

Experiment 002

Description

File structure

Evaluation, notes and questions

Experiment 003

Description

File structure

Evaluation, notes and questions

Experiment 004

Description

File structure

Evaluation, notes and questions

Experiment 005

Description

File structure

Experiment 006

Description

File structure

Evaluation, notes and questions

Experiment 007

Description

File structure