Wiki
Clone wikigs2 / Optimising_your_runs
Optimising your runs
A quick guide to optimisations
This is a placeholder page which will "soon" be filled with lots of useful information regarding flags, compilation options and tips for making your runs go faster or more efficiently.
Input flags
What can you change in the input file?
layouts_knobs
Variable | Default | Description |
---|---|---|
opt_redist_persist | .false. |
|
opt_redist_persist_overlap | .false. |
|
opt_redist_nbk | .false. | Set to true to use non-blocking comms in the redistribute routines. |
opt_redist_init | .false. | Set to true to use optimised initialisation routines for creating redist objects. This should be beneficial for all but the smallest number of processors. |
intmom_sub | .false. | When set to true we will use sub-communicators to do the reduction associated with calculating moments of the dist. fn. These sub-communicators involve all procs with a given XYS block. This is particularly advantageous for collisional runs which don't use LE layouts. PLEASE NOTE: These changes only effect calls to integrate_moments with complex variables and where the optional 'all' parameter is supplied. There is no gather of data from other procs so the integration result is only known for the local XYS block. This could cause a problem for diagnostics which want to write the full array if they also pass "all" (see1) |
intspec_sub | .false. | When set to true we will use sub-communicators to do the reduction associated with calculating species integrated moments of the dist. fn. These sub-communicators involve all procs with a given XY block. This should help improve the field update scaling. PLEASE NOTE: Unlikeintmom_subwe currently gather the results from other procs such that we are sure to have the whole array populated correctly. |
local_field_solve | .false. | When set to true we force the response matrix initialisation (including inversion) to be done serially on every processor. This is said to be beneficial on machines with slow network communications. It is also probably the optimal choice for "small" problems. The definition of small will depend on the ratio of the machines processor speed and the communication time. Generally worth trying this for your problem. |
fields_knobs
Variable | Default | Description |
---|---|---|
do_smart_update | .false. | Used with field_option='local'. If .true. and x/y distributed then in advance only update local part of field in operations like "phinew=phinew+phi" etc. |
field_subgath | .false. | Set to true to use allgatherv to fetch parts of the field update vector calculated on other procs. When false uses a sum_allreduce instead. This doesn't rely on sub-communicators so should work for any layout and processor count. |
field_option | 'default' | Set to 'local' to use a version of the implicit algorithm with a different data decomposition. This enables effective use of a number of internal optimisations and is often quite a bit faster to initialise (and should scale better in advance). |
field_local_allreduce | .false. |
|
field_local_allreduce_sub | .false. |
|
minnrow | 64 | Used with field_option='local' to set the minimum block size (in a single supercell) assigned to a single processor. Tuning this parameter changes the balance between work parallelisation and communication. The lower this is set the more communication has to be done but the fewer processors don't get assigned work (i.e. helps reduce computation time). The optimal value is likely to depend upon the size of the problem and the number of processors being used. Furthermore it will effect intialisation and advance in different ways. |
dist_fn_knobs
Variable | Default | Description |
---|---|---|
opt_init_bc | .false. | Set to .true. to use an optimised version of init_connected_bc. This routine can become relatively expensive at large core counts / for large problems. Note, whilst the optimised routine is faster than the default routine it still does not scale well. There may be room for improvement in the future to drastically improve this routine. |
opt_source | .false. | If true then use an optimised linear source calculation which uses pre-calculated coefficients, calculates both sigma together and skips work associated with empty fields. Can contribute 10-25% savings for linear electrostatic collisionless simulations. For more complicated runs the savings will likely be less. If enabled memory usage will increase due to using an additional array of size 2-4 times gnew. Can potentially slow down certain runs. |
Compilation options
What can you change when compiling?
General tips
Any hints and tips.
- The response matrices can be dumped to file once calculated (see dump_response). This allows subsequent runs to read in the response matrix (see read_response) rather than recalculating it. This can be useful in cases where you are restarting a run without having changed the time step (e.g. after your walltime has expired). Note: These files are written/read by a single processor. This means that they can be used with runs on different numbers of cores (and even on different machines provided GS2 has been compiled with netcdf support in both cases).
Machine specific guides
Useful information for specific machines
Single Restart File Guide
Helios
- Load the Intel compiler, MPI, FFTW3, parallel NetCDF, and parallel HDF5 libraries as follows
moduleloadintelbullxmpinetcdf_phdf5_pfftw/3.3.3
- Compile using appropriate compile flags
makeUSE_FFT=fftw3USE_PARALLEL_NETCDF=on
- Remember to load the above libraries at run time in the submission script!
Updated