I found (in SpEC) that mvapich2 had issues with memory fragmentation that were not present in impi. So I'd not switch to mvapich2 as the default before the release without testing it with a full sized simulation.
I have had endless problems with Intel MPI, both on Stampede and on Datura in the past. We now use OpenMPI on Datura. On Stampede, I recently tried to start a 256 core job using Intel MPI, and got the error
Fatal error in MPI_Init: Other MPI error, error stack:
MPID_Init(195)........................: channel initialization failed
dapl_rc_setup_all_connections_20(1272): generic failure with errno = 671092751
MPID_nem_dapl_get_from_bc(1239).......: Missing port or invalid host/port description in business card
immediately after the ibrun command. This is intermittent; repeating the same run worked fine. Looking back through my email history, I see that I also had this error:
[59:c453-703][../../dapl_poll_rc.c:1360] Intel MPI fatal error: ofa-v2-mlx4_0-1 DTO operation posted for [1:c411-103] completed with error. status=0x8. cookie=0x0
Assertion failed in file ../../dapl_poll_rc.c at line 1362: 0
internal ABORT - process 59
which would cause the run to abort after several hours of runtime, which Roland has also seen with SpEC.
In contrast, I have been running production runs with mvapich for many months on stampede, and have never had any such problems. mvapich is also, as pointed out by Erik, the system default on stampede.
I have just tested Intel MPI and mvapich with qc0-mclachlan, and a higher-resolution version which uses about 80% of the memory on 256 cores. The speed on 32 cores (low resolution) and on 256 cores (high resolution) appears to be similar between the two MPI versions.
Since I have had so many problems with Intel MPI, I suggest that we change the simfactory default to mvapich. The only reported problem that I can find is that Roland saw memory usage increasing with time with SpEC, but since we have not seen the same with Cactus, I don't think this should influence the decision.
I have a tested optionlist and runscript ready to commit. I have also made sure that you can switch to Intel MPI by just selecting the required optionlist and runscript; the compilation and run are independent of the module command, so the Intel MPI module does not need to be loaded in envsetup (it doesn't do anything that isn't taken care of in the optionlist and runscript anyway).
I have also run the ET testsuite with mvapich, and get only the expected failures (ADM etc).
OK to change the default in simfactory to mvapich?
The comment at the bottom of "stampede.ini" is wrong; please remove or update it. In addition to this change, the list of loaded modules also needs to be changed.
Can you use the suffix "mvapich2" instead of "mvapich"?
I have changed the suffix to mvapich2. The comment is not wrong; as stated in my comment above, the module command is not necessary; to use Intel MPI, you just need to select the required optionlist and runscript, meaning you can have configurations with either MPI version using the same machine definition file. I checked the module definitions, and also tested this explicitly. Erik and I discussed this, and he agrees that it can be committed as-is.