Over the past few months, I have been using RIT's LazEv code with only minor hiccups on stampede (particularly, the unreproducible 'dapl_conn_rc' crashes that I'm sure other stampede users are familiar with). This checkout was of the previous release, ET_2013_05, and compiled with Intel MPI. Most of the jobs I ran took advantage of some symmetry, and I was able to run on 12-16 nodes at about 50-60% memory usage.
After the sync issue was backported, I checked out the new release, ET_2013_11 and immediately ran into problems. The first issue was with the run performance and LoopControl, which, with the mailing lists help, we sorted out. The second was with crashes and checkpointing. With both Intel MPI and MVAPICH2 configurations, the code would hang (~50% of the time) when dumping checkpoints, and 100% when dumping a termination checkpoint. Further, the crashes seem more frequent, and I couldn't get a simulation to run for a full 24 hours without crashing (either by stalling on checkpointing or otherwise).
So, I checked out a clean version of the toolkit, with only toolkit thorns, and removed any thorns specific to RIT. I compiled with both the Intel MPI and MVAPICH2 configurations in simfactory.
In both cases, I can run the 'qc0-mclachlan.par' file to completion with no issues. So I edited the qc0 parfile to update the grid, remove the symmetries, and update the initial data to match my test parameter file. I ran the job on 20 nodes, and with either configuration, I was not successful in running the job to completion on any of my numerous attempts. Intel MPI runs die with the standard unhelpful "dapl_conn_rc" error at random times in the evolution, and the MVAPICH2 dies with:
[c431-903.stampede.tacc.utexas.edu:mpispawn_7][readline] Unexpected End-Of-File on file descriptor 6. MPI process died? [c431-903.stampede.tacc.utexas.edu:mpispawn_7][mtpmi_processops] Error while reading PMI socket. MPI process died? [c431-903.stampede.tacc.utexas.edu:mpispawn_7][child_handler] MPI process (rank: 15, pid: 106620) terminated with signal 9 -> abort job [c429-501.stampede.tacc.utexas.edu:mpirun_rsh][process_mpispawn_connection] mpispawn_7 from node c431-903 aborted: Error while reading a PMI socket (4)
The IMPI jobs died with the same dapl_conn_rc error at run times of 2 hours, 8 hours, and 21 hours. I also had one job that hung and did not exit until it was killed by the queue manager. The MVAPICH2 jobs died at around 3 hours and 8 hours with the error above.
We've been in contact with TACC and they said it was a Cactus issue, so I am sending this report.
Attached is the parameter file I used for the tests. They should work with a stock ET_2013_11 checkout.