Refactor/redesign archiving

Create issue
Issue #64 new
anonymous created an issue

Implement archiving using archive machines with an iomethod key. Provide another key like rsync-excludes for people to exclude files from being archived. Provide a lightweight archive-like method for copying a simulation from one machine to another machine.


Comments (4)

  1. Ian Hinder
    • removed comment

    Archiving needs to take the following into account:

    • On some systems archiving cannot be performed only at the end of the simulation because the first restarts might be purged before that happens. On Kraken there is a 30-day purge policy and we have some simulations which have taken months.
    • Archiving can take a long time - longer than an interactive session on the login node can be expected to last. Some systems, e.g. Kraken, provide a dedicated archiving queue. We should use such a queue if it is available, or we could use "screen" on the login node if not.
    • There could be both a manual and an automatic archiving method.

    Here is a possible implementation:

    If a simulation is created with the --archive option, simfactory checks when each restart runs if there are any previous restarts which have not been archived and are not currently being archived. If there are any, it submits a job to the archive queue which archives each restart. Each restart would be tar/gzipped independently. This is necessary because the simulation might not have finished yet. It would be very convenient to be able to add the "archive" option to an existing simulation so that subsequent restarts will archive the whole simulation. You often don't know which simulations are going to end up being long-lived until after a few restarts.

    For simulations which are not archived automatically, simfactory could provide an archive command which performed the archiving immediately. There could be variants to do this either immediately or using the queueing system. It's probably best to again archive individual restarts, to keep the code as simple as possible and to only need to support a single archiving convention.

    There could also be a "restore" command which restored all the restarts of a simulation. This again might have to be run in an archive queue.

    There should be options to exclude specific files from archiving. By default this would be checkpoint files only, but we could provide templates for 3D output files as well, as these often are not needed.

  2. Ian Hinder
    • removed comment

    If we make very large tar files before archiving, on lustre filesystems this might lead to a single storage target becoming full. We can set the stripe count of the tar file to ensure it is spread across multiple storage targets to avoid this problem.

    lfs setstripe -c -1 <filename>

    Since each machine probably has its own archiving system, we will want to be able to choose an archiving script for each machine in the mdb. We could have standard ones for, e.g., TSM. This could detect if it was running on lustre and if so, set the stripe count.

    When more than one person in a group works on a project, each of them should be given access to restore the archive. There are TSM commands to do this. There should be a mechanism in simfactory for deciding who to give access to, and this should be done by default. This could be overridden on a per-simulation basis. Something like "--archive-access user1,user2,user3" and "archive-access = user1,user2,user3".

    Some ideas from Erik:

    • High level commands provided by simfactory:

    1. archive a simulation 2. list archived simulations 3. restore a simulation 4. delete an archive

    • When compressing files, often gzip --fast is much faster than the default options, and the loss of compression is fairly small. If compression time becomes a bottleneck I would try this.
  3. Roland Haas
    • removed comment

    Hello all, this is a student project at NCSA for this semester. Is there any existing implementation (worth salvaging) or should we start from scratch? Most likely this iteration will be rather hands-on since we really want this to first work for Blue Waters/NCSA, Stampede2/TACC, CampusCluster/UIUC which hopefully covers a variety of usage cases.

  4. Log in to comment