- removed comment
Restart fails
A simulation restart with SimFactory failed with the following error message:
DEBUG: checkpoint file: /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0000/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 file=/work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0000/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 dfile=/work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0001/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 DEBUG: linking /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0000/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 to /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0001/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 before link Error: Could not link checkpoint file /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0000/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 to /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0001/d3.0-mclachlan/checkpoint.chkpt.it_0.h5
Keyword:
Comments (4)
-
reporter -
reporter - removed comment
On second thought, it might be because I wasn't checking to make sure the restore_dir, in this case /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0001/, existed before attempting to link the file. I've added code to create the restore_dir if it doesn't exist.
-
reporter - changed status to resolved
- removed comment
this has been fixed.
-
- edited description
- changed status to closed
- Log in to comment
I've commited a patch to include the operating system error when os.link fails. Looking at the code and the debug output, I can't see any obvious reason why this would fail. I've taken out the sys.exit(1) and instead replaced it with a return False, which will disable checkpointing. Hopefully when this happens again, the operating system error will help pinpoint the reason this is happening.