python simfactory submit does not remove active link after job execution has finished

Issue #285 closed
anonymous created an issue

The python version of simfactory does not remove the active link

output-xxxx-active

in the simulation directory, after a job has finished.

As a consequence the next submit fails with the error message, that more then one active link has been found. Removing the link manually enables a poper submit of the next job. The problem occurs on a lustre file system.

Keyword:

Comments (8)

  1. Erik Schnetter
    • removed comment

    Removing the link is part of an explicit action that Simfactory needs to take after a restart has finished. This step is called "cleanup" and moves the simulation from the active to the inactive state. Only inactive simulations can have a new restart submitted. Among other things, this prevents accidentally submitting multiple restarts for the same simulations.

    If the link is not removed, then this indicates that the restart is not cleaned up. When you submit a new restart, Simfactory should automatically try to clean up existing restarts, and output a descriptive error message if this fails. For example, it can be that the previous restart is still running.

    Manually removing this symbolic link should never be necessary. If you want, you can execute the cleanup command manually, but this should also not be necessary.

    Can you provide more details?

  2. anonymous reporter
    • removed comment

    After my last run:

    [snip] ls -l simulations/cactus_test-luca-no-openmp

    drwxr-xr-x 5 alibeck users 4096 Feb 13 17:06 output-0000 drwxr-xr-x 5 alibeck users 4096 Feb 15 11:04 output-0001 lrwxrwxrwx 1 alibeck users 11 Feb 14 11:54 output-0001-active -> output-0001 drwxr-xr-x 7 alibeck users 4096 Feb 13 17:06 SIMFACTORY [snip]

    Submitting the new job with simfactory command:

    simfactory/bin/sim submit cactus_test-luca-no-openmp --configuration test-luca-no-openmp --machine=damiana --hostname=damiana --procs=8 --queue=intel.q --walltime=06:00:0 --parfile=TOVMHD_CarpetRegrid2_PPM_HLLE_BSSNMoL.par

    failed with the error:

    [snip] Error: more than one active restart id found in directory /lustre/AEI/alibeck/simulations/cactus_test-luca-no-openmp [snip]

    And looking now into

    simulations/cactus_test-luca-no-openmp:

    [snip] -rw-r--r-- 1 alibeck users 41397 Feb 15 11:09 log.txt drwxr-xr-x 5 alibeck users 4096 Feb 13 17:06 output-0000 drwxr-xr-x 5 alibeck users 4096 Feb 15 11:04 output-0001 lrwxrwxrwx 1 alibeck users 11 Feb 14 11:54 output-0001-active -> output-0001 drwxr-xr-x 3 alibeck users 4096 Feb 15 11:26 output-0002 lrwxrwxrwx 1 alibeck users 11 Feb 15 11:09 output-0002-active -> output-0002 drwxr-xr-x 7 alibeck users 4096 Feb 13 17:06 SIMFACTORY [snip]

    If I remove manually output-0001-active, I can submit the job again.

  3. Erik Schnetter
    • removed comment

    I have just tried submitting a job on Damiana that recovers twice from a checkpoint, and this worked fine. No superfluous "-active" symbolic links remained.

  4. Barry Wardell
    • removed comment

    I am also using SimFactory - including checkpointing and recovery - without any problems so I think this is no longer an issue.

  5. Log in to comment