Enable HDF5 compression by default in Carpet

Create issue
Issue #1282 closed
Ian Hinder created an issue

The current default for CarpetIOHDF5::compression_level is 0 (no compression). I have been using compression in most of my HDF5 files for years, and have never run into any problem. CPUs are typically much faster than storage nowadays. I propose that the compression level should default to 9. This would affect output and checkpoint files, and could lead to huge space savings. Apart from the checkpoint files being written and read quicker and taking less disk space, the user should not notice, as the HDF5 library handles compression transparently.


Comments (9)

  1. Erik Schnetter
    • removed comment

    Yes, it should be enabled by default.

    When we ran benchmarks, we found that a compression level of 1 produced almost the same file sizes as a level of 9.

    Other features (e.g. checksums?) could also be enabled by default.

  2. Frank Löffler
    • removed comment

    I also agree that enabling compression is a good idea, and I go along with Ian that a compression level of 1 is almost as good as 9 and would be more conservative in terms of computation. I just wish there would be better compressions available in stock hdf5 because one argument against using gzip-compression is that compressing the uncompressed h5 files using something else later (bzip2 or xz) saves more space (but also makes the files unreadable using stock hdf5). Nevertheless, enabling compression by default is probably best.

    Checksums are implemented a while now, but I don't know how widely they are used. I regularly forget to enable them. I don't see a reason against making them default too.

  3. Roland Haas
    • removed comment

    I had also used compression (with level 1 I believe) in the past for hydro runs with a good amount of 2d output. I found that (a) my run was slowed down and (b) that for the smallish datasets that are written to 2d hdf5 files, the file size actually increased.

    Ideally before we make this change we should try and have at least one round of direct tests to compare speeds. Eg. runs qc0 and a nsns inspiral (which I'll happyly run) for a bit with and without compression to see what happens. Obviously since this is a a parameter, one can always change it back in one's runs but I'd rather avoid adding a parameter that is set to non-default values by most users (though if it is faster for vacuum and slower for hydro runs I would not be sure what to do).

  4. Roland Haas
    • removed comment

    My best suggestion is that compression may be beneficial for 3d datasets but not for 2d ones (since compression implies HDF5 chunking). This is based on data from Philipp's large MHD run where compressing 3d Bcons data is beneficial (on the 10% level) but compressing 2d data from the same files increases the 2d file size by about 11%. Since there is lots of turbulence in theses simulations, compression is hard, on the other hand the files are also using "large" patch sizes so that compression can be effective and the overhead of chunking is not bad.

    So for checkpoints and 3d output compression seems beneficial, but not for 2d output. Assuming that 3d output is much bigger than 2d output even for a single 3d file, there probably is an overall benefit of enabling compression. We may want to introduce a set of out3d_compression_level etc parameters though and use the current compression_level one only for out_vars type output (which is 3d) and checkpointing (which uses the same routines).

  5. Frank Löffler
    • removed comment

    Instead of distinguishing by 2D/3D we could look at the overall size of the relevant data chuck. I do this successfully in another project (non-Cactus). If this threshold is large enough to, in general, expect a benefit from compression we enable compression, otherwise we don't (and don't use chunks). This would suggest a parameter for the size of this threshold.

  6. Log in to comment