Wiki

Running Sunrise with MPI

Overview

Several other Monte Carlo codes can use MPI but, as far as I know, they use it to extend the "embarrasingly parallel" aspect of the calculation: they duplicate the grid on each task and then run the MC independently. This works great for speeding up a calculation but requires that each node have enough memory to fit the entire problem. Since the use case for Sunrise commonly involves running many snapshots at the same time, this kind of parallelism doesn't buy you much since you can already just run the different snapshots in parallel.

However, if the problem you want to run is too large to fit in memory on the nodes you are using, what is needed is to split the problem domain into several smaller chunks so that they do fit on the individual nodes. This is much more complicated, because the domain needs to be decomposed into chunks, and individual rays need to be transported between the nodes similarly to how a distributed-memory hydro code works. The Sunrise MPI functionality has been developed with this situation in mind.

Unfortunately, the distributed domain parallelism has large communication requirements. Each ray that crosses into another task's domain needs to be sent over to that task, where it may scatter and come back again, etc. This makes the distributed domain functionality quite expensive and slow compared to running on a single node. Splitting a simulation up on two nodes will roughly take twice the amount of time it did to run on one node, so do not use MPI for speeding up runs, only to be able to run a large simulation that won't fit in memory on one node. The rule of thumb is thus:

Use as few MPI tasks (1 if possible) as is necessary to not run out of memory on the nodes.

Also note that the results will not be invariant when changing the numbers of nodes, because the random numbers will be different, and scatterings are always stratified to happen on each node (this aspect will be explained in the paper describing the MPI version which is in the works).

How to run

Running mcrx in MPI mode is pretty straightforward (only mcrx supports MPI, not sfrhist or broadband). A few things should be noted:

When specifying n_threads, set it to the number of cores on the nodes minus 1. One extra thread is used for the MPI master task that handles the sending and receiving of rays, and if it doesn't have a free core, it won't be able to keep up with the communication.
The keyword domain_level determines at which level in the octree the domain decomposition is done. Higher levels enable better load balancing, but there's an associated overhead. Since you'll be using few MPI tasks, level 2-3 should be satisfactory.
Run mcrx as usual but with mpirun or however your batch system wants it. You must configure the job so that the batch system only spawns one MPI process per node, which is different from e.g. running Gadget which uses one MPI task per core. The ways to do this vary with batch system, so you'll have to figure that out.
The mcrx output will be a bit duplicated, in that each MPI task will print most of the info. The work of going in and changing all the printouts to only print from one task remains to be done.

When the shooting is running, mcrx currently dumps out statistics, like how many rays have been created and sent/received on the tasks, every 10 seconds or so. This should probably be removed for production runs.

Bugs/Issues

The MPI functionality is new and has not been tested extensively, so it's likely there may be problems on different systems, different hardware, etc.

One known issue is that the MPI "sm" BTL (shared memory byte transfer layer) seems to be buggy in multithreaded situations. If you get an error about "sm btl received an unknown type of header", try switching to tcp communication instead of shared memory by giving mpirun the argument -mca btl tcp,self. If you are using other communications btls (like infiniband), you need to add them too. (It's possible this only crops up when running more than one MPI task on the same node, which you should only do for testing.)