- removed comment
SimFactory could provide a "get" command which would copy the lightweight data from a remote simulation to the local machine. For example, I use a script called "getsim" which rsyncs a simulation directory and excludes all known "large" files, for example checkpoints, 2D and 3D HDF5 output files, Cactus executable, core dumps etc. You could also have a "quick" mode which excludes even more. My script is attached as an example. This is slightly similar to archiving, but has a different purpose. This is to be run regularly on the local machine to keep track of a simulation, rather than archiving it permanently once it is complete.
Keyword:
We will need some way to ensure that the retrieved data is in a consistent state. Truncated ASCII files can be dealt with, though this is not ideal, but partially-written HDF5 files cannot. This can be a serious problem if several HDF5 files are being synced, as after each sync, there can be a high probability that at least one of them is incompletely written. Some options:
SimFactory (on the remote machine) writes a control file (either into the simulation or somewhere else) which indicates to the simulation that it should not open any new files for writing, and once it has closed any currently open file as part of normal operation, it should record this information in the control file, continue running, and only write new files once the control file tells it to. This has the disadvantage of locking the entire simulation for the duration of the transfer of the active restart. For slow data transfers, this could be a significant amount of time. This approach has the disadvantage that the different files will not be in a consistent state; e.g. one output file may have the current iteration but another may not.
Before writing a file, Cactus would move it to a new location (file.tmp), and only move it back when it was fully written. SimFactory would not sync tmp files. Any files which had been renamed to *.tmp in the first pass would then be synced in a separate pass using their original names, if they exist. Repeat until all files are synced. This solution also does not maintain a consistent state across multiple files. This does not require write access to the simulation directory, so could also be used by collaborators who do not own the simulation.
Similar to (1), but only performed at the end of an iteration. SimFactory would indicate to Cactus to pause the simulation at the end of the current iteration, when all files are presumably valid on disk. Cactus would indicate that the simulation had paused in a control file, and SimFactory would then transfer the data, and unpause the simulation when it was finished. This would guarantee that the synced data was in a consistent state. We might want to have some mechanism to ensure that simulations do not remain paused forever, perhaps by requiring simfactory to update the control file periodically if it is still syncing.
All of the above apply only to the active restart. I think (3) is the simplest and most robust. It is also the most expensive in SUs. The control file location could be customisable, and placed somewhere that all collaborators have write access.