Wiki

Optimising your runs

A quick guide to optimisations

This is a placeholder page which will "soon" be filled with lots of useful information regarding flags, compilation options and tips for making your runs go faster or more efficiently.

Input flags

What can you change in the input file?

layouts_knobs

Variable	Default	Description
opt_redist_persist	.false.	Set to true to use persistent (non-blocking) comms in the redistribute routines. Must also set opt_redist_nbk=.true. Can help improve scaling efficiency at large core counts, but can cause slow down at low core counts.
opt_redist_persist_overlap	.false.	Set to true to try to overlap the mpi and local parts of the gather/scatter routines. Should only be used with opt_redist_persist=.true. Note whilst persistent communications are non-blocking, as with all non-blocking communications, the specific mpi implementation doesn't have to allow the communication to occur in the background, e.g. it may occur when we reach the mpi_wait. As such enabling overlapping may not bring any benefit. It can sometimes help to increase the largest message size which will use the eager protocol (e.g. through export MPICH_GNI_MAX_EAGER_MSG_SIZE=131072 on Archer), though this can have a detrimental impact on the collective communications.
opt_redist_nbk	.false.	Set to true to use non-blocking comms in the redistribute routines.
opt_redist_init	.false.	Set to true to use optimised initialisation routines for creating redist objects. This should be beneficial for all but the smallest number of processors.
intmom_sub	.false.	When set to true we will use sub-communicators to do the reduction associated with calculating moments of the dist. fn. These sub-communicators involve all procs with a given XYS block. This is particularly advantageous for collisional runs which don't use LE layouts. PLEASE NOTE: These changes only effect calls to integrate_moments with complex variables and where the optional 'all' parameter is supplied. There is no gather of data from other procs so the integration result is only known for the local XYS block. This could cause a problem for diagnostics which want to write the full array if they also pass "all" (see1)
intspec_sub	.false.	When set to true we will use sub-communicators to do the reduction associated with calculating species integrated moments of the dist. fn. These sub-communicators involve all procs with a given XY block. This should help improve the field update scaling. PLEASE NOTE: Unlikeintmom_subwe currently gather the results from other procs such that we are sure to have the whole array populated correctly.
local_field_solve	.false.	When set to true we force the response matrix initialisation (including inversion) to be done serially on every processor. This is said to be beneficial on machines with slow network communications. It is also probably the optimal choice for "small" problems. The definition of small will depend on the ratio of the machines processor speed and the communication time. Generally worth trying this for your problem.

fields_knobs

Variable	Default	Description
do_smart_update	.false.	Used with field_option='local'. If .true. and x/y distributed then in advance only update local part of field in operations like "phinew=phinew+phi" etc.
field_subgath	.false.	Set to true to use allgatherv to fetch parts of the field update vector calculated on other procs. When false uses a sum_allreduce instead. This doesn't rely on sub-communicators so should work for any layout and processor count.
field_option	'default'	Set to 'local' to use a version of the implicit algorithm with a different data decomposition. This enables effective use of a number of internal optimisations and is often quite a bit faster to initialise (and should scale better in advance).
field_local_allreduce	.false.	Set to true to use an allreduce (on mp_comm) in field calculation (field_option='local' only) rather than a reduction on a sub-communicator followed by a global broadcast. Typically a little faster than default performance but may depend on MPI implementation.
field_local_allreduce_sub	.false.	Set to true, along with field_local_allreduce and intspec_sub, to replace the allreduce used in field calculation with an allreduce on a sub-communicator followed by a reduction on a "perpendicular" communicator. Typically a bit faster than default and scales slightly more efficiently. Note if this option is active only proc0 has knowledge of the full field arrays. Other processors know the full field for any supercell (connected x-y domains) for which it has any of the xy indices local in the g_lo layout.
minnrow	64	Used with field_option='local' to set the minimum block size (in a single supercell) assigned to a single processor. Tuning this parameter changes the balance between work parallelisation and communication. The lower this is set the more communication has to be done but the fewer processors don't get assigned work (i.e. helps reduce computation time). The optimal value is likely to depend upon the size of the problem and the number of processors being used. Furthermore it will effect intialisation and advance in different ways.

dist_fn_knobs

Variable	Default	Description
opt_init_bc	.false.	Set to .true. to use an optimised version of init_connected_bc. This routine can become relatively expensive at large core counts / for large problems. Note, whilst the optimised routine is faster than the default routine it still does not scale well. There may be room for improvement in the future to drastically improve this routine.
opt_source	.false.	If true then use an optimised linear source calculation which uses pre-calculated coefficients, calculates both sigma together and skips work associated with empty fields. Can contribute 10-25% savings for linear electrostatic collisionless simulations. For more complicated runs the savings will likely be less. If enabled memory usage will increase due to using an additional array of size 2-4 times gnew. Can potentially slow down certain runs.

Compilation options

What can you change when compiling?

General tips

Any hints and tips.

The response matrices can be dumped to file once calculated (see dump_response). This allows subsequent runs to read in the response matrix (see read_response) rather than recalculating it. This can be useful in cases where you are restarting a run without having changed the time step (e.g. after your walltime has expired). Note: These files are written/read by a single processor. This means that they can be used with runs on different numbers of cores (and even on different machines provided GS2 has been compiled with netcdf support in both cases).

Machine specific guides

Useful information for specific machines

Single Restart File Guide

Helios

Load the Intel compiler, MPI, FFTW3, parallel NetCDF, and parallel HDF5 libraries as follows

moduleloadintelbullxmpinetcdf_phdf5_pfftw/3.3.3

Compile using appropriate compile flags

makeUSE_FFT=fftw3USE_PARALLEL_NETCDF=on

Remember to load the above libraries at run time in the submission script!