Consolidate data formats to simplify postprocessing

Issue #2543 open
Wolfgang Kastaun created an issue

Currently, writing postprocessing tools for ET is unnecessarily difficult because required information needs to be collected from many locations, has to accommodate competing standards, and sometimes require guesses using heuristics. Below is a list of improvements from the postprocessing viewpoint, which is not complete and can be augmented over time.

  1. Table of content for grid variable output. Each output folder should contain a machine-readable file that keeps track of all files containing grid data, with a list of variables and the available timesteps for each variable. Of course this should distinguish between 1D, 2D and 3D output. Currently, one has to open all files and parse the content for metadata, which can be very slow with HDF5. The issue is especially problematic when using one file per group.
  2. The same for reduction output.
  3. A machine readable file with all parameters and their values, including those not set in the parfile and set to default. The values should be values, postprocessing code should not have to emulate the handmade programming language parfiles have become. Each folder with a restart should have one such file in a standard location/name.
  4. The reductions thorn should also output enough information to convert norm1/average into volume integrals, i.e. a scalar x such that volume integral = x * average
  5. Unique extensions. There should be one and only one unique extension for each type of file, across all standard thorns that produce output. In particular, just adding ‘.h5’ is not enough. For example 3D data currently has extension xyz.h5 or just .h5 and multipole data can have extension .h5 as well.
  6. Simfactory should also provide machine-readable metadata about restarts and simulation folders. It should be possible to easily obtain a tree-like structure of the various restarts, complete with iteration ranges.
  7. One standard format for timeseries. Currently reductions, 0D output, multipoles, and AHorizonFinder all have their own formats, multipoles has even two formats. Timeseries files should also contain metadata with the range of available iterations and times.
  8. Settle on one format for each type of data and deprecate the rest, unless several formats are really needed, e.g. for performance reasons. A prime example is 2D ascii output, which is inferior to hdf5 in every way. If deprecation is not possible, having tools to convert from all competing formats into a canonical format after the simulation might help. Another duplication of effort is caused by the one-per-group/one-per-variable duality.
  9. There should support to add arbitrary metadata to simulations. For example, initial data thorns could add model properties such as BNS masses, spins, separations. There should be an API such that code can add metadata, and some standard location where all metadata is collected in a machine and human readable format such as json. This should be on the simulation level, such that new metadata can be added during restarts, but existing metadata is immutable. Each thorn should have its own metadata namespace.

Comments (11)

  1. Gabriele Bozzola

    Thanks Wolfgang for this clear and detailed wishlist. I would like to add another item and share a thought.

    10. Grid arrays should be clearly distinguishable from grid functions. At the moment, Carpet outputs grid arrays in the exact same way as grid function. CarpetIOASCII also attaches some coordinates (that are completely meaningless). This confuses post-processing tools.

    (Bonus. Thorns should be discouraged from implementing their custom writing routines (e.g., VolumeIntegrals), unless strictly needed. )

    In the spirit of consolidating the data formats, I argue that Einstein Toolkit should provide a clear and well-documented way to analyze simulations. At the moment, Einstein Toolkit supports multiple different types of output and users write their own homebrew scripts depending on the way they decide to run simulations. For example, people might decide to output 2D grid data in the objectively inferior ASCII format because it is much easier to plot with gnuplot. If we provide a clear way to post-process data, users won’t need to understand the (not clearly documented) intricacies of how data is output to start doing science, and the way the output is stored will become an implementation detail under our control. This will allow us to settle on a specification for how the output should be produced.

  2. Erik Schnetter

    I have added a routine to CarpetX that output metadata similar to the ones you request. See here for an example.

    It is difficult to create a single file that contains information about all iterations, because this means that this file needs to be overwritten at each iteration. Instead, I’m creating a new file per iteration, and the post-processing tool should read all these files.

    Since this is just a proof of concept, there are two metadata files (that should change later). The first, cactus-metadata, should have all the data requested above. The second, carpetx-metadata, already existed in CarpetX. It describes the grid structure (that’s probably not interesting here), but incidentally also all parameters in the very explidit format requested.

    The metadata files are output in yaml. That’s a standard format that should be easy to read in Python, resulting in dictionaries and arrays.

    Wolfgang, Gabriele, could you comment?

  3. Gabriele Bozzola

    I can start giving you some first comments, but one would have to start thinking about the design of the postprocessing tool to make more serious comments.

    One quick comment is that the file doesn’t tell me all how to read a variable. Suppose I want to read max(hydrogpu::rho), what column is it in the tsv file? If I understand how the tsv is structured (it contains many reductions), we need to enforce certain constrains to make sure that we can determine the column number without parsing the tsv file. For example, the order of variables and reductions in the yaml file must be the same as in the tsv file and all the variables must appear in all the reductions.

    A second comment is: on some shared filesystems, opening files can be extremely expensive, so having fewer big files is much better than having many small ones. An example of this is the how Einstein Toolkit outputs multipoles now, which can lead to thousands of ASCII files, or one HDF5 file (and the performance difference is really important). If each the data for each iteration is stored in different files, I worry that it might lead to performance problems.

    Also, why did you pick yaml over the (faster but less powerful) json?

  4. Erik Schnetter

    The key format_name and format_version tell you how to interpret the content of the file. I don’t think it is feasible to have a generic description in the metadata that would allow you to extract the information from all the file formats we support. (If that was possible then we wouldn’t have multiple file formats.) Thus the reader needs to understand and have special support for the specific file format, be it Silo or HDF5 or TSV or JSON.

    In this particular case, the TSV file has column headers that identify the content. The TSV files are small, and reading them is fast.

    Regarding slow reading: In the example I provided, there are two files written per iteration, one TSV file with the norms, and one Silo/HDF5 file with 3D data. If there had been many nodes, then there might have been more files to allow nodes to write data independently (which speeds things up).

    If there is a thorn that produces many files, then the thorn should be updated, or it should switch to a standard output format. That issue is independent of setting up a table of contents. There are probably cases where having many ASCII files makes sense, e.g. for debugging or developing scripts.

    I chose yaml since that is already supported in CarpetX, and because it is human-readable. The file encoding really isn’t important and can be changed quickly.

  5. Wolfgang Kastaun reporter

    Maybe we should have two tickets, one for CarpetX and one for simplifying postprocessing with the Carpet-based infrastructure, assuming that will stay around for some time. It will probably take a while until I update postcactus for CarpetX.

  6. Wolfgang Kastaun reporter

    The proposed solution for CarpetX would probably simplify the logic which gathers information on the available data a lot. I’m not entirely sure about speed. The reason parsing the hdf5 files takes so long seems a design flaw that requires basically to read the whole file just to get the names of all datasets, combined with the unfortunate choice of having the table of content information only in those names. I also did not get how the solution would look like for runs using many nodes, would there be one file for each node or even MPI process, or is the information collected first on one node?

    But in principle, reading a few thousand yaml files should not take that long, we would have to try.

  7. Wolfgang Kastaun reporter

    Regarding the new format for reductions, do I understand correctly that there will be one file per iteration with all available reductions? That means one has to parse the data for all reductions just to get one of them. However, given the small data size, this might not be too wasteful.

    What about other data of type time series, e.g. 0D output, are they treated different, format-wise?

  8. Erik Schnetter

    The idea is to have one such metafile per output directory. I’ll have to think about the one-file-per-iteration setup, it does seem wasteful and inconvenient. There certainly won’t be one file per node, things would be aggregated.

    If you like the design for CarpetX, then we can repeat it for Carpet (or, rather, CactusBase/IOUtil) and use it for all simulations. I’m sure we’ll have to iterate on the design, and then flush out a few bugs where the design doesn’t make sense or is too limited.

    0D output etc. are already supported in the format. There is a key that specifies which directions of a variable is output: [0,1,2] is for 3D output, [1] is for output in the y direction, etc.

    Yes, HDF5 is slow. That’s why I want to switch to ADIOS2. This format has essentially the same capabilities as HDF5 (blocks of data, arbitrary types, attributes, groups, etc.), but it properly separates metadata, and is parallel by design. It’s also safe to append to an ADIOS2 file (internally data are separated into “iterations” which cannot be modified once written).

  9. Erik Schnetter

    Wolfgang: I see I misread you question about 0D data and reductions. Yes, there is one file per iteration with all reductions. This file is small, and it’s probably more efficient to have them in a single file.

    0D output isn’t implemented yet. 1D output is currently (very inefficiently) one file per iteration per variable per direction. My plan is to have 1D and 2D output in binary as well since it’s not scalable to keep it in ASCII. ASCII output is very convenient for debugging, I don’t have a good solution for this yet. Maybe keep the current inefficient format around, or have a simple tool that converts binary to ASCII.

  10. Log in to comment