Checkpointing fails on LoneStar

Issue #534 closed
Ian Hinder created an issue

Hi,

I am starting to run on LoneStar, but find that I cannot checkpoint. This is for a production simulation. I see:

INFO (CarpetIOHDF5): --------------------------------------------------------- INFO (CarpetIOHDF5): Dumping periodic checkpoint at iteration 9876, simulation time 18.5175 INFO (CarpetIOHDF5): ---------------------------------------------------------

on stdout, and there is nothing on stderr. The checkpoint files are partially written:

-rw------- 1 hinder G-25181 61M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_0.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_10.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_11.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_12.h5 -rw------- 1 hinder G-25181 46M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_13.h5 -rw------- 1 hinder G-25181 46M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_14.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_15.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_16.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_17.h5 -rw------- 1 hinder G-25181 46M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_18.h5 -rw------- 1 hinder G-25181 46M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_19.h5 -rw------- 1 hinder G-25181 60M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_1.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_20.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_21.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_22.h5 -rw------- 1 hinder G-25181 46M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_23.h5 -rw------- 1 hinder G-25181 58M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_24.h5 -rw------- 1 hinder G-25181 58M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_25.h5 -rw------- 1 hinder G-25181 58M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_26.h5 -rw------- 1 hinder G-25181 57M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_27.h5 -rw------- 1 hinder G-25181 58M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_28.h5 -rw------- 1 hinder G-25181 58M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_29.h5 -rw------- 1 hinder G-25181 60M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_2.h5 -rw------- 1 hinder G-25181 58M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_30.h5 -rw------- 1 hinder G-25181 57M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_31.h5 -rw------- 1 hinder G-25181 59M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_3.h5 -rw------- 1 hinder G-25181 46M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_4.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_5.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_6.h5 -rw------- 1 hinder G-25181 47M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_7.h5 -rw------- 1 hinder G-25181 46M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_8.h5 -rw------- 1 hinder G-25181 46M Aug 25 15:58 checkpoint.chkpt.tmp.it_9876.file_9.h5

and invalid:

c334-106$ h5ls checkpoint.chkpt.tmp.it_9876.file_0.h5 checkpoint.chkpt.tmp.it_9876.file_0.h5: unable to open file

The job and the processes are all still running. Logging into one of the nodes and attaching gdb to the Cactus process yields:

0x00002b8eb3f6a287 in MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff50b96cb0, vbuf_ptr=0x7fff50b96cb8) at ibv_channel_manager.c:367 367 if (*head && vc->mrail.rfp.p_RDMA_recv != vc->mrail.rfp.p_RDMA_recv_tail) (gdb) bt #0 0x00002b8eb3f6a287 in MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff50b96cb0, vbuf_ptr=0x7fff50b96cb8) at ibv_channel_manager.c:367 #1 0x00002b8eb3f0426c in MPIDI_CH3I_read_progress (vc_pptr=0x7fff50b96cb0, v_ptr=0x7fff50b96cb8, is_blocking=259164064) at ch3_read_progress.c:130 #2 0x00002b8eb3f023dd in MPIDI_CH3I_Progress (is_blocking=1354329264, state=0x7fff50b96cb8) at ch3_progress.c:206 #3 0x00002b8eb3f6852c in MPIC_Wait (request_ptr=0x7fff50b96cb0) at helper_fns.c:518 #4 0x00002b8eb3f67e9c in MPIC_Recv (buf=0x7fff50b96cb0, count=1354329272, datatype=259164064, source=30, tag=0, comm=-788556288, status=0x1) at helper_fns.c:76 #5 0x00002b8eb3eee47e in MPIR_Bcast_OSU (buffer=0x7fff50b96cb0, count=1354329272, datatype=259164064, root=30, comm_ptr=0x0) at bcast_osu.c:283 #6 0x00002b8eb3eed0c6 in PMPI_Bcast (buffer=0x7fff50b96cb0, count=1354329272, datatype=259164064, root=30, comm=0) at bcast.c:1274 #7 0x0000000000c12fc4 in CarpetIOHDF5::WriteVarChunkedParallel (cctkGH=0x7fff50b96cb0, outfile=1354329272, io_bytes=@0xf7287a0, request=0x1e, called_from_checkpoint=false, indexfile=-788556288, $q8=<value optimized out>, $q9=<value optimized out>, $r0=<value optimized out>, $r1=<value optimized out>, $r2=<value optimized out>, $r3=<value optimized out>) at /work/00915/hinder/Cactus/llama/arrangements/Carpet/CarpetIOHDF5/src/Output.cc:519 #8 0x0000000000bf3f7f in CarpetIOHDF5::Checkpoint (cctkGH=0x7fff50b96cb0, called_from=1354329272, $W3=<value optimized out>, $W4=<value optimized out>) at /work/00915/hinder/Cactus/llama/arrangements/Carpet/CarpetIOHDF5/src/CarpetIOHDF5.cc:973 #9 0x0000000000bf360b in CarpetIOHDF5::CarpetIOHDF5_EvolutionCheckpoint (cctkGH=0x7fff50b96cb0) at /work/00915/hinder/Cactus/llama/arrangements/Carpet/CarpetIOHDF5/src/CarpetIOHDF5.cc:186 #10 0x0000000000413a5f in CCTK_CallFunction (function=0x7fff50b96cb0, fdata=0x7fff50b96cb8, data=0xf7287a0) at /work/00915/hinder/Cactus/llama/src/main/ScheduleInterface.c:291 #11 0x00000000011c2eb1 in Carpet::CallFunction (function=0x7fff50b96cb0, attribute=0x7fff50b96cb8, data=0xf7287a0, $01=<value optimized out>, $04=<value optimized out>, $05=<value optimized out>) at /work/00915/hinder/Cactus/llama/arrangements/Carpet/Carpet/src/CallFunction.cc:135 #12 0x0000000000418dea in CCTKi_ScheduleCallFunction (function=0x7fff50b96cb0, attribute=0x7fff50b96cb8, data=0xf7287a0) at /work/00915/hinder/Cactus/llama/src/main/ScheduleInterface.c:2826 #13 0x000000000041bc26 in CCTKi_DoScheduleTraverse (group_name=0x7fff50b96cb0 "", item_entry=0x7fff50b96cb8, item_exit=0xf7287a0, while_check=0x1e, if_check=0, function_process=0x2aabd0ff9600, data=0x7fff50b97848) at /work/00915/hinder/Cactus/llama/src/schedule/ScheduleTraverse.c:158 #14 0x0000000000414f19 in CCTK_ScheduleTraverse (where=0x7fff50b96cb0 "", GH=0x7fff50b96cb8, CallFunction=0xf7287a0) at /work/00915/hinder/Cactus/llama/src/main/ScheduleInterface.c:812 #15 0x000000000116f6c0 in Carpet::CallAnalysis (cctkGH=0x7fff50b96cb0, $=2=<value optimized out>) at /work/00915/hinder/Cactus/llama/arrangements/Carpet/Carpet/src/Evolve.cc:556 #16 0x000000000116e755 in Carpet::Evolve (fc=0x7fff50b96cb0, $<1=<value optimized out>) at /work/00915/hinder/Cactus/llama/arrangements/Carpet/Carpet/src/Evolve.cc:81 #17 0x000000000040ccc5 in main (argc=4, argv=0x7fff50b98888) at /work/00915/hinder/Cactus/llama/src/main/flesh.cc:84

I'm not sure why CarpetIOHDF5 is performing MPI calls. This is with the stable version of Carpet.

Keyword: CarpetIOHDF5

Comments (20)

  1. Erik Schnetter
    • removed comment

    There are two reasons why Carpet calls MPI during output. The first is for testing that the value of DISTRIB=constant grid arrays is the same on all processes; if not, a warning is output. This is probably what you are seeing here. The second is to ensure that all MPI processes have successfully finished closing the output files before the previous checkpoint files are deleted.

    It seems that Lonestar cannot handle a simple MPI_Bcast operation. This would indicate a problem with the MPI setup there. You could either look at Lonestar's documentation and compare it to the option list and run script you are using, or contact their help desk and ask for advice.

  2. Ian Hinder reporter
    • removed comment

    I have tried running with OpenMPI which is also provided as a module on LoneStar but is not the default. The default is mvapich2. With OpenMPI the checkpointing proceeds without problems. Shall I commit the change to the scripts in simfactory?

  3. Frank Löffler
    • removed comment

    I just recently (yesterday) restarted successfully using the standard configuration of simfactory2 on lonestar, with the hg version of Carpet if that matters. This is using mvapich2. It did the checkpoint without problem and restarted as well, without problem as far as I can see. This was using 16 cp files per checkpoint (running on 16 nodes, 12*16 cores).

  4. Ian Hinder reporter
    • removed comment

    I also managed to checkpoint successfully, but with a tiny test run on 2 processes. So it probably depends on the precise setup that you run with. Since the hang happen in the check for grid scalars, it might also depend on how many of those you have. I also experienced an unusual hang with IsolatedHorizon with mvapich2 which went away when I switched to QuasiLocalMeasures, but this might also have been fixed by going to OpenMPI.

  5. Ian Hinder reporter
    • removed comment

    Frank, do you agree that I should make !LoneStar use OpenMPI in simfactory? Since we have instances where it failed with mvapich2 and succeeded with OpenMPI, and no instances of it failing with OpenMPI, it sounds like OpenMPI is the better thing to use. I have also seen from google searches [http://g-rsm.wikispaces.com/lonestar+at+TACC | people recommending OpenMPI on lonestar over mvapich2] for their codes. I will wait until my production simulations with OpenMPI have finished to make sure that there are no more problems.

  6. Frank Löffler
    • removed comment

    I don't care which mpi version is used, as long as it works. I just commented here because nobody else reported either failure nor success with mvapich2, and I didn't see these problems so far. Maybe the differences between the git and hg version are large enough to not be a problem here, whether that problem might be in Carpet or mvapich2. I actually have something running now using the git version (not because of this ticket) and it might be interesting to see if I can checkpoint this. Also, I believe Roland is using the Curie-ET on lonestar and I know that he could checkpoint without problems using the git version as well. So, while I am not per se against changing the MPI version, I also see some people who do not have any problem with the current setup. It would be interesting to know why, although it might be hard to find out, I agree. How many people tried OpenMPI on lonestar, and was there a speed-penalty?

  7. Frank Löffler
    • removed comment

    I forgot to mention: I also recently had a >3k core run on ranger, using the Curie-release (and thus, the git version of Carpet), and checkpointing multiple times without any problem. This makes it unlikely to be a problem with only large jobs.

  8. Frank Löffler
    • removed comment

    Ian: do you still see this problem? It might be worthwhile to find out why it works for me (and others) and doesn't for you. Maybe your runs include some variables/arrays/something which give rise to these problems?

  9. Erik Schnetter
    • removed comment

    Ian, why don't you commit a new option list (and run script) called lonestar-openmpi.cfg? In this way, everybody can try, and switching over is then simply a matter of choosing a different configuration in the machine file. This will also let people more easily compare the two settings; e.g. Frank could compare performance (which would be the only reason why one of the MPI libraries may be preferable if we can make both work for everybody).

  10. Ian Hinder reporter
    • removed comment

    I believe that Barry was able to run successfully very similar parameter files to mine with exactly the same code, so I am at a loss as to why I was seeing those hangs. Maybe it was a temporary problem. I will try to go back to mvapich2 and see if the problem reappears.

  11. Barry Wardell
    • removed comment

    Replying to [comment:11 hinder]:

    I believe that Barry was able to run successfully very similar parameter files to mine with exactly the same code, so I am at a loss as to why I was seeing those hangs. Maybe it was a temporary problem. I will try to go back to mvapich2 and see if the problem reappears.

    Yes, I have run several simulations with several restarts each and have not encountered this problem at all. My code is identical to Ian's and my parameter files are almost the same too.

  12. Frank Löffler
    • removed comment

    Looking closer at my runs, I see that they sometimes (about once every one or two days of runtime) stall for some time. I just catched one of these times and attaching gdb revealed that the processes stalled in MPI calls within reductions called from IOASCII. Typically these stalls eventually continue after some time, but the "lost" time is clearly visible in a M/h plot. I am not sure if this is I/O or MPI hanging, both are involved in your and my case. I don't seem to have problems accessing the files though. Maybe it's worth trying openmpi after all. Ian: would you be interested to provide such a option list?

  13. Ian Hinder reporter
    • removed comment

    The optionlist has been committed. I would like to change the default optionlist for LoneStar to lonestar-openmpi.cfg for the ET release. Two people (Frank and I) have now seen problems with MVAPICH and no one has reported a problem with OpenMPI. The only problem at the moment is that we were doing production simulations and hence were using the stable version of Carpet, which means that the development version has not been tested with OpenMPI on LoneStar. I would be surprised if this led to problems, given that OpenMPI is used on many other machines without problems. In the time before the release, I can't do extensive testing, so I propose that I run a test BBH simulation on LoneStar and if that works with OpenMPI, then I will commit the OpenMPI parameter file. I think this is better than releasing with MVAPICH which we have seen to have problems.

  14. Frank Löffler
    • removed comment

    I agree. Also, make sure that the testsuites are run with the new version successfully at least once before the release please.

  15. Erik Schnetter
    • removed comment

    Please commit the new option list right away, with a new name. The switching over is then just changing the default option list in the machine description. This allows others to test, and will allow people to switch back to compare later on.

  16. Ian Hinder reporter
    • removed comment

    The optionlist and runscript were committed as r1462 on 12-Sep-2011 (see first sentence of comment:15 :) ).

  17. Ian Hinder reporter
    • changed status to resolved
    • removed comment

    The default optionlist and runscript have been changed in lonestar.ini, qc0-mclachlan was run with good results, and the test suites were re-run with the new configuration. Closing the ticket.

  18. Log in to comment