If a simulation runs out of disk space while writing an HDF5 file, the simulation will terminate. The hdf5 file being written might then be corrupt, and all data from it may be irretrievable. In that case, restarting from the last-written checkpoint file may leave a "gap" in the data corresponding to the period between the start of the failed restart and the last checkpoint file written.
Steps to reproduce:
• Start a simulation which checkpoints periodically and consists of several restarts • Keep all checkpoint files • Restart 0000 completes successfully and checkpoints at iteration i1 • Restart 0001 checkpoints once after some evolution at iteration i2 • Restart 0001 terminates abnormally while writing an HDF5 output file at iteration i3 • The output file is corrupted and nonrecoverable, so there is no data from iteration i1 to iteration i3 • Restart 0002 starts at iteration i2 as this is the last checkpoint available • The simulation continues until the end, but the data from the corrupted HDF5 file between iteration i1 and i2 is lost
1. Write HDF5 files safely, e.g. by first copying the file to a new temporary file, performing the write, then atomically moving the temporary file over the original file. The original file would then remain in the event of a crash while writing the new file. This could be very expensive for 3D output files. 2. Start a new set of HDF5 files after each checkpoint. This seems to be the most efficient and simplest solution, but requires readers of HDF5 files to be modified to take it into account. 3. Check the consistency of all HDF5 files in the previous restart(s) on recovery, and recover from the latest checkpoint file for which all previous HDF5 files are valid. We could use code to check the HDF5 file, or some other flagging mechanism to indicate that HDF5 writes were completed successfully; e.g. we could rename the HDF5 file to .tmp during writes, and rename it back after a successful write. This is complex and requires Cactus or simfactory to look into previous restarts. It also only applies to HDF5 files, and requires breaking several abstraction barriers. 4. Wait for HDF5 journalling support. As far as I know only metadata journalling is planned, which is probably not enough, and in any case, they are not actively working on the next version of HDF5 at the moment due to lack of funding. 5. Checkpoint only on termination of the simulation
In reality, we do not keep all checkpoint files. I usually keep just the last checkpoint file. I believe that a Cactus simulation will only delete checkpoint files which it has itself written, which means that there will generally be one checkpoint file kept per restart; the last one written. This means that you can always recover from the above situation by rerunning the restart during which the problem occurred. However, keeping one checkpoint file per restart is a problem in itself, and we should fix this as well, which would then mean the potential for losing data in the case of an interrupted write operation.