HDF5 file integrity

Issue #278 resolved
Prof Garth Wells created an issue

HDF5 files might not be valid if a program is killed prematurely. It would be good to have a workaround for this. Need to:

  1. Check what approaches HDF5 provides natively for this issue
  2. Possibly allow output files, e.g. the 'heavy' HDF5 data in XDMF output to be split in multiple files (controlled by the user)

Comments (15)

  1. Chris Richardson

    There doesn't seem to be much out there about option 1. I'm not sure there is much you can actually do if you get SIGKILL-ed with the file open.

  2. Mikael Mortensen

    Strangely enough, it came up for me two days ago. My machine crashed and 1GB and 2 days worth of simulation results were lost. That is, the xdmf/hdf5 files were still there, but they were broken and according to a few google searches, unfixable.

    Any chance this would have been avoided if I had set "flush_output" to true?

  3. Jan Blechta

    Explicit HDF5File.flush() could also be useful feature. Besides other use cases, one could catch SIGKILL and call it.

  4. Chris Richardson

    @blechta You might find it hard to catch SIGKILL. Other signals can be caught, e.g. SIGTERM, I guess - some batch scheduling systems can be configured to send SIGTERM just before SIGKILL.

    @mikael_mortensen I have found that setting parameters['flush_output'] = true helps maybe 50% of the time.

    I have made some edits to XDMFFile.cpp which should implement 2. above, I'll try and push them today.

  5. Johan Hake

    Would it help to always close the file after a write? Then when one write to the file one open it again but now for appending data?

  6. Lawrence Mitchell

    I believe PBS just does this out of the box but it may be turned off. There is a queue-specified delay (site specific) between sending SIGTERM and SIGKILL. Your best bet is probably archer support.

  7. Chris Richardson

    @johanhake that would work, but doesn't protect against interrupts whilst writing.

    Catching signals is a bit of a pain, and still not 100% effective against e.g. power outages.

    I have just pushed xdmf-multiple-h5.

    If anyone can test it out, that would be helpful...

  8. Chris Richardson

    Seems like the correct solution (?) is to use the H5_FDSPLIT file driver to save the file metadata separately from the main data. A backup of the metadata can be kept during any updates, which should make the file readable even if interrupted...

    However, XDMF seems not to support split raw/meta data format.

    See also the bottom of: HDF5 metadata - metadata journalling will be supported from HDF5 v1.10 (currently we are on v1.8).

    Another option which might work, is to copy the entire file to a backup file before appending. Obviously this has performance implications.

  9. Chris Richardson

    Has anyone else tried out the xdmf-multiple-h5 branch? It is working for me, so it could potentially be merged into next

  10. Chris Richardson

    A workaround (multi-file option) is now in master, until such time as metadata journalling comes into HDF5.

  11. Log in to comment