Restart fails

Create issue
Issue #109 closed
anonymous created an issue

A simulation restart with SimFactory failed with the following error message:

DEBUG: checkpoint file: /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0000/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 file=/work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0000/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 dfile=/work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0001/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 DEBUG: linking /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0000/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 to /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0001/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 before link Error: Could not link checkpoint file /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0000/d3.0-mclachlan/checkpoint.chkpt.it_0.h5 to /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0001/d3.0-mclachlan/checkpoint.chkpt.it_0.h5

Keyword:

Comments (4)

  1. anonymous reporter
    • removed comment

    I've commited a patch to include the operating system error when os.link fails. Looking at the code and the debug output, I can't see any obvious reason why this would fail. I've taken out the sys.exit(1) and instead replaced it with a return False, which will disable checkpointing. Hopefully when this happens again, the operating system error will help pinpoint the reason this is happening.

  2. anonymous reporter
    • removed comment

    On second thought, it might be because I wasn't checking to make sure the restore_dir, in this case /work/eschnett/philip/simulations/d3.0-mclachlan-i0031/output-0001/, existed before attempting to link the file. I've added code to create the restore_dir if it doesn't exist.

  3. Log in to comment