- removed comment
python simfactory submit does not remove active link after job execution has finished
The python version of simfactory does not remove the active link
output-xxxx-active
in the simulation directory, after a job has finished.
As a consequence the next submit fails with the error message, that more then one active link has been found. Removing the link manually enables a poper submit of the next job. The problem occurs on a lustre file system.
Keyword:
Comments (8)
-
-
reporter - removed comment
After my last run:
[snip] ls -l simulations/cactus_test-luca-no-openmp
drwxr-xr-x 5 alibeck users 4096 Feb 13 17:06 output-0000 drwxr-xr-x 5 alibeck users 4096 Feb 15 11:04 output-0001 lrwxrwxrwx 1 alibeck users 11 Feb 14 11:54 output-0001-active -> output-0001 drwxr-xr-x 7 alibeck users 4096 Feb 13 17:06 SIMFACTORY [snip]
Submitting the new job with simfactory command:
simfactory/bin/sim submit cactus_test-luca-no-openmp --configuration test-luca-no-openmp --machine=damiana --hostname=damiana --procs=8 --queue=intel.q --walltime=06:00:0 --parfile=TOVMHD_CarpetRegrid2_PPM_HLLE_BSSNMoL.par
failed with the error:
[snip] Error: more than one active restart id found in directory /lustre/AEI/alibeck/simulations/cactus_test-luca-no-openmp [snip]
And looking now into
simulations/cactus_test-luca-no-openmp:
[snip] -rw-r--r-- 1 alibeck users 41397 Feb 15 11:09 log.txt drwxr-xr-x 5 alibeck users 4096 Feb 13 17:06 output-0000 drwxr-xr-x 5 alibeck users 4096 Feb 15 11:04 output-0001 lrwxrwxrwx 1 alibeck users 11 Feb 14 11:54 output-0001-active -> output-0001 drwxr-xr-x 3 alibeck users 4096 Feb 15 11:26 output-0002 lrwxrwxrwx 1 alibeck users 11 Feb 15 11:09 output-0002-active -> output-0002 drwxr-xr-x 7 alibeck users 4096 Feb 13 17:06 SIMFACTORY [snip]
If I remove manually output-0001-active, I can submit the job again.
-
- assigned issue to
- changed component to SimFactory
- removed comment
-
reporter - removed comment
Due to a hint of Erik I have just uodated simfactory from the repository:
https://svn.cct.lsu.edu/repos/numrel/simfactory/branches/PYSIM_2010
Hoewever, the bug still remains.
-
- removed comment
I have just tried submitting a job on Damiana that recovers twice from a checkpoint, and this worked fine. No superfluous "-active" symbolic links remained.
-
- removed comment
I am also using SimFactory - including checkpointing and recovery - without any problems so I think this is no longer an issue.
-
- changed status to resolved
- removed comment
I close this ticket because the issue has been resolved.
-
- edited description
- changed status to closed
- Log in to comment
Removing the link is part of an explicit action that Simfactory needs to take after a restart has finished. This step is called "cleanup" and moves the simulation from the active to the inactive state. Only inactive simulations can have a new restart submitted. Among other things, this prevents accidentally submitting multiple restarts for the same simulations.
If the link is not removed, then this indicates that the restart is not cleaned up. When you submit a new restart, Simfactory should automatically try to clean up existing restarts, and output a descriptive error message if this fails. For example, it can be that the previous restart is still running.
Manually removing this symbolic link should never be necessary. If you want, you can execute the cleanup command manually, but this should also not be necessary.
Can you provide more details?