SimFactory metadata deleted by periodic filesystem purges

Create issue
Issue #322 new
Ian Hinder created an issue

Production filesystems are subject to periodic purges (typically on the order of weeks or months) where data which has not been accessed recently is deleted. This means that it is possible for some restarts of very long-running simulations to be deleted by the system. This can be addressed by an automated archiving system, but such a system does not address the problem that the simulation metadata directory (currently called SIMFACTORY) and any restarts which have not been run yet, will also be purged. This would make it impossible to submit future restarts and limits the number of chained restarts you can submit to the purge time of the system.

One possibility to solve this problem would be to store a backup, or "shadow" copy of all the simulation metadata in a non-volatile location. This could be the user's home directory, or a "work" directory which is not purged. The details would need to be worked out.

This is not a serious issue yet.

Keyword:

Comments (4)

  1. Ian Hinder reporter
    • removed comment

    This could also be addressed by "touching" each of the required metadata files when each restart begins. Since the metadata files are not large, this should not be seen as an abuse of the system.

  2. Erik Schnetter
    • removed comment

    I'm afraid that the instructions on these systems are quite clear -- touching files is considered abuse. I would not suggest people to do this without permission from the HPC centres.

    However, touching files that will be needed for a currently running or submitted restart is a different issue. We'd need a mechanism to touch these files often enough if the job waits in the queue for a significant amount of time.

  3. Ian Hinder reporter
    • removed comment

    Another (more complicated, possibly too complicated and confusing) option is to have the simulation metadata stored in a work filesystem and the actual data only in the scratch filesystem. They could be connected by symbolic links in the work filesystem so the user "sees" a unified simulation there. SimFactory itself would know how to handle these links when archiving, purging or getting the simulation.

  4. Ian Hinder reporter
    • removed comment

    Re: comment:2, we would also need a mechanism to touch the checkpoint files, or arrange that these were also stored in a nonvolatile location. These might be too big for that. Without the checkpoint files, the simulation can't be recovered anyway. We should ask the admins what to do about checkpoint files which are sufficiently old to be purged between jobs. I have heard of 2-week purge times, and it's not impossible that jobs could take longer than that to start.

  5. Log in to comment