Severe performance problem on Stampede

Create issue
Issue #1850 closed
Ian Hinder created an issue

With the ET_2015_11 release, there is a severe performance problem on Stampede. This is when hwloc and SystemTopology are not activated. Activating these thorns causes simulations to run 8 times faster. This suggests that the affinity settings in simfactory for stampede are wrong. stampede-mvapich2.run has

export KMP_AFFINITY=norespect,compact # verbose

Is this correct? Looking at the output of "top", we see the expected 16 threads, but each is running at only 50%. There is no migration between cores, as far as we can tell. This 50% should be 100%, and this doesn't explain the factor of 8 slowdown, but it shows that there is something wrong.

Keyword:

Comments (36)

  1. Ian Hinder reporter
    • removed comment

    Removing the setting of KMP_AFFINITY restores the speed to something similar to what is obtained from using hwloc and SystemTopology.

  2. anonymous
    • removed comment

    I've noticed that removing the setting of KMP_AFFINITY while leaving SystemTopology and hwloc on also results in reduced speed. Is this expected behavior?

  3. Ian Hinder reporter
    • removed comment

    I wouldn't have expected that. How much is it reduced? Can you give examples of your speeds in each of the three cases?

  4. Roland Haas
    • removed comment

    The issue is sometimes that there are things that run before SystemTopology and hwloc have had a chance to set affinity, eg MPI_INIT which itself may for example allocate memory which is then bound to the "wrong" core. If I remember correctly, then there is a ticket/email thread somewhere in which Erik suggests adding a pre-MPI_INIT hook to the flesh so that SystemTopology can set affinity early. Note that this has to happen (as the thread explains) before parameter files are parsed since one cannot access argv before MPI_INIT so in particular it will ignore ActiveThorns settings in parfiles.

  5. Ian Hinder reporter
    • removed comment

    Would it make sense for hwloc/SystemTopology to refuse to change the affinity if it has already been set? We probably don't want it set in two places.

  6. Erik Schnetter
    • removed comment

    Affinity isn't "set" or "unset"; it's always set to a particular setting. We can introduce an environment variable to tell SystemTopology to do nothing; one then has to set it in the run script.

    Is there a reason you want to run without SystemTopology, or you think this should be the default? The thorn is there for a reason -- not to be fancy, but because it's basically the only way to ensure things run efficiently on most systems. Everything else requires very system-specific settings that are very fragile.

  7. Frank Löffler
    • removed comment

    To summarize the current status: setting KMP_AFFINITY seems to be necessary for performance when using SystemTopology, but is harmful when not using it: either you have to use both, or none. Do I understand this correctly?

    If so: - Is it understood why using SystemTopology results in poor performance when KMP_AFFINITY isn't set? - Is it understood why using KMP_AFFINITY results in poor performance when not using SystemTopology (but not using KMP_AFFINITY seems to be fine)?

    My comments: It could be that 'using KMP_AFFINITY" isn't the right way to think about this. We set it specifically to "norespect,compact". Maybe we should set it to something different to work without SystemTopology.

  8. Ian Hinder reporter
    • removed comment

    The problem is that all existing parameter files would need to be modified to include activation of SystemTopology, otherwise they will run very slowly. There is no warning about this; using the same parameter file that worked before, you just get really bad performance. People probably also have their own thornlists which don't activate SystemTopology. I think the best option would be to correct the affinity settings on stampede, so that they don't cause the code to be slow. Is it clear that they are wrong?

  9. Erik Schnetter
    • removed comment

    We can activate SystemTopology automatically, and/or require it by thorn MPI.

    We can correct the affinity settings on Stampede. But what about Blue Waters, Comet, Cori, Edison, ...? For each system, and for each combination of (number of MPI processes per node, number of OpenMP threads per MPI process), we need to check that the run script sets up something useful.

    In this case here, we didn't even discuss the number of MPI processes and OpenMP threads used, and we didn't check that the new "good" settings do something reasonable for other cases.

    There are at least three important cases: one MPI process per node, one per socket, and one per core; and if someone experiments with under-subscribing a node (because of running out of memory), we don't silently want to do the wrong thing. The respective if statements and conditions in the run scripts can be hairy.

  10. Erik Schnetter
    • removed comment

    I am surprised that the KMP_* option is necessary or beneficial in any case. This sets up affinity via the Intel compiler. The compiler knows nothing about MPI, hence it cannot reasonably distribute threads when there are multiple MPI processes per node.

    SystemTopology can undo all thread affinities. However, since MPI is initialized before SystemTopology runs, it already needs to have the correct socket (but not core) affinities set up on startup. The queueing system can do this, but not the compiler. This is why it is currently important to have the queuing system set up at least socket affinities.

    As the original report speaks of "16 threads", this may be the case where there is 1 MPI process with 16 threads running. If so, I am very surprised that the Intel compiler does not set up good affinities -- as in this case, it has sufficient knowledge to do so. It may be that this option was chosen assuming there is a 1:1 correspondence between sockets and MPI processes?

  11. anonymous
    • removed comment

    Replying to [comment:4 hinder]:

    I wouldn't have expected that. How much is it reduced? Can you give examples of your speeds in each of the three cases?

    I have data from a simulation that ran approximately 9x slower than baseline--however, I have tried to replicate this result without success, as subsequent runs saw no difference between KMP_AFFINITY being exported in the runscript or not. So for now, I suspect there is some other factor involved, or perhaps some sort of temporary issue with stampede. Either way, the issue seems to be unrelated after all.

  12. Erik Schnetter
    • removed comment

    Can someone who observed a slowdown post the setup, i.e. number of MPI processes, threads, nodes, cores, etc.?

  13. Ian Hinder reporter
    • removed comment

    The run I originally reported on was run by Seth Hopper. It was using 2 processes per node, each with 8 threads, which is appropriate for Stampede. I believe it was using 96 cores in total (so 6 nodes, 12 processes). We checked the Carpet report of processes and threads, and all was in order. I mentioned that I saw 16 threads in top, because that is the total number of threads; 2 x 8. We were not trying to do anything non-standard. The parameter file was identical to one ran previously on Datura with no problems, but it ran more slowly, which was unexpected. When activating hwloc and SystemTopology, or removing the KMP_AFFINITY line, it went faster by a factor of 8.

    I think that SimFactory's machine database should provide reasonable performance by default, and not require people to use hwloc and SystemTopology to avoid an 8-times slowdown. I don't think the current situation is just "suboptimal"; I think it is a bug. What does that KMP_AFFINITY setting do? Do you think it is correct? If it is not feasible to set the affinity properly in simfactory, then I think the best thing is for simfactory to not set it at all, and rely on the system default. The performance may not be optimal, but it shouldn't be 8 times too slow. Then, to get top performance, people can set affinity by activating those thorns (or they can be activated automatically).

  14. Seth Hopper
    • removed comment

    Yes, what Ian described about my run is almost exactly correct, except that I was running on 80 cores (5 nodes with 16 cores) instead of 96. But the processes per node and number of threads is correct. - Seth

  15. Ian Hinder reporter
    • removed comment

    Replying to [comment:8 knarf]:

    To summarize the current status: setting KMP_AFFINITY seems to be necessary for performance when using SystemTopology, but is harmful when not using it: either you have to use both, or none. Do I understand this correctly?

    That is not what I observed. From the results that I saw, the only combination which results in slow speeds (factor of 8) is setting KMP_AFFINITY as simfactory sets it, and not using the thorns. This suggests that the thorns are doing the right thing, and overriding whatever the environment variable has set; hence anyone who uses those thorns won't see a problem. It also suggests that the environment variable setting is wrong (not just suboptimal). To debug the problem, we could run hwloc (or is it SystemTopology?) with parameters set to just report the affinity, rather than set it, and see what the environment variable is doing. The documentation for that variable is at https://software.intel.com/en-us/node/522691#AFFINITY_TYPES, but I find it hard to understand:

    type = compact Specifying compact assigns the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed. For example, in a topology map, the nearer a node is to the root, the more significance the node has when sorting the threads.

    modifier = norespect Do not respect original affinity mask for the process. Binds OpenMP threads to all operating system processors. In early versions of the OpenMP run-time library that supported only the physical and logical affinity types, norespect was the default and was not recognized as a modifier. The default was changed to respect when types compact and scatter were added; therefore, thread bindings for the logical and physical affinity types may have changed with the newer compilers in situations where the application specified a partial initial thread affinity mask.

    My initial reading of this is that "norespect" means that threads within a process may run on any OS processor, which I think translates into any physical core, i.e. also any physical processor. But I am not an expert on this variable. Erik, do you know what this setting is supposed to do?

    Note that Michael Clark reported different results, but he says that they were probably not accurate, as he cannot reproduce them now.

    Michael: is it possible that the run script you were using was not the updated one you had modified? Editing the run script in simfactory/mdb/runscripts is not sufficient. It then needs to be added to the Cactus configuration before rerunning. This requires a "sim build <config> --runscript <runscriptname>".

  16. Erik Schnetter
    • removed comment

    The description of KMP_AFFINITY was clearly written by a professional documentation writer who knows how to write a lot of text while introducing sufficient ambiguity to not even be wrong.

    As I mentioned above, setting this variable (when not using SystemTopology) is probably wrong unless there is only one MPI process per node. Please read my description above, and say what it is unclear if you don't understand it; I won't repeat it here.

    Yes, we all agree that the default options should be better. There are two options:

    (1) Examine all possible "reasonable" configurations (MPI processes per node, OpenMP threads per process), and ensure that the new setting work fine in all cases (2) Examine only a subset of configurations, and ensure that things don't change except for these configurations, using appropriate if-statements in the submit script.

    I mentioned earlier that I am regularly running benchmarks on a variety of systems. Of course I am using SystemTopology, so I didn't notice this problem. I usually run multiple benchmarks for multiple configurations (MPI processes / OpenMP threads) to find the optimum configuration.

  17. anonymous
    • removed comment

    Replying to [comment:16 hinder]:

    Note that Michael Clark reported different results, but he says that they were probably not accurate, as he cannot reproduce them now.

    Michael: is it possible that the run script you were using was not the updated one you had modified? Editing the run script in simfactory/mdb/runscripts is not sufficient. It then needs to be added to the Cactus configuration before rerunning. This requires a "sim build <config> --runscript <runscriptname>".

    Short version: I'm aware this is required to change the runscript for a configuration. I double-checked the simulation directories to make sure that the runscripts were correct.

    Longer version: I ran a simulation "runA" with executable "exeA", and "runB" with executable "exeB". These executables used the same optionlist (default), thornlist (containing hwloc and SystemTopology), and submitscript (default), and the runscripts differed by one having an additional, commented out line. The simulations used the same parameter file that has both hwloc and SystemTopology. Both runscripts had the "export KMP_AFFINITY..." line commented out as well. I ran "runA" on Monday, and it ran 9-10x slower than baseline, leading me to make my previous comment. I ran "runB" yesterday, however, and I saw baseline performance.

    For good measure, I performed a few other tests: I reconfigured with identical runscripts, and I also used the executable exeA to perform the same simulation runA again, without recovery, to see if that executable was still slow. I found it ran yesterday at the same (high) speed as baseline.

    Some misc notes on obstacles to these tests: I found it inconvenient that... (a) the configuration has to be rebuilt merely to change the default runscript, as in particular this means the resulting executables are different, despite having the same optionlist and thornlist (different in the sense of having different md5 hashes). This is why I went through with redoing the simulation runA with exeA. I suspect the executables are different because, in part, they have information about the date of compilation that is printed at the beginning of a run.

    I think having a command line option to provide the runscript would be convenient, albeit unlikely to be used once performance considerations have been resolved. Moreover, you can provide --runscript to simfactory's create-submit command, and simfactory will silently ignore this. (Thankfully, this wasted only 5 minutes of my time.)

    (b) The option --norecover flatly did not work for rerunning a simulation from the beginning, despite being advertised in simfactory as the default. I had to manually delete checkpoint directories to perform the simulation again in the same simulation directory.

    So as to what could have caused the discrepancy I originally observed? I cannot say with any certainty. As far as something under my control, I considered whether this had to do with envsetup or module loads. I have often in the past run with envsetup set to "sleep 0", but I have not at any point observed a performance impact of changing envsetup on runs using mvapich2. I tested this with intel MPI module loaded as default as well as with intel MPI loaded in envsetup; neither had any performance impact.

    That is all I have at this time.

  18. David Radice
    • removed comment

    I recently stumbled upon this issue while running on Stampede with ET_2015_11. I do not use simfactory, I am not setting the KMP_AFFINITY environment variable, and I am using OpenMPI instead of mvapich.

    I should also mention that I do not see this problem with ET_2014_11, even when using the same optionlist, thornlist, and runscripts. I tried running both with and without the "tacc_affinity" wrapper that TACC recommends to use for hybrid OpenMP/MPI jobs and I see no difference. The only solution is to activate the "SystemTopology" thorn.

    I do not think that this issue should be really considered as a simfactory bug. The issue is clearly somewhere else. Would it make sense to have SystemTopology always active? Is there a reason why I would want to run without it?

  19. Erik Schnetter
    • removed comment

    SystemTopology's default behaviour is harmful if you run multiple independent simulations on the same node, e.g. if you are using a personal workstation. In this case, the two SystemTopology instances don't know about each other, and will both choose the same cores, reducing performance by 1/2. There is no good way around this -- if a simulation doesn't "own" a node, then something else with more knowledge (e.g. the kernel) needs to schedule thread/core assignments.

    Otherwise, SystemTopology should always be beneficial, or at least not harmful.

  20. Ian Hinder reporter
    • removed comment

    David,

    Note that the functionality of SystemTopology used to be present in hwloc, and Carpet used to activate hwloc automatically. In your ET_2014_11 runs, can you check to see if hwloc is automatically activated? This would explain what you observed: the old runs used hwloc, and the new runs didn't. Activating SystemTopology restores the old behaviour.

    What was the ratio of speeds between the two runs?

  21. Ian Hinder reporter
    • removed comment

    The tacc_affinity script is recommended for stampede (https://portal.xsede.org/tacc-stampede). It is supposed to guarantee that the processes are distributed among sockets, as well as the memory they allocate. We already use this in the simfactory run script. This is the only place that this can be done, because once Cactus has started, MPI has already initialised itself, and may have allocated memory on the wrong socket if the process has not yet been pinned to a socket.

    Erik has indicated that the KMP_AFFINITY setting is likely only correct when you have only a single process per node, and using this variable cannot be correct when you have more than one process per node, because the compiler, which interprets this variable, does not know about the additional processes. I observed that removing the setting of this variable eliminated the performance problem that I saw when not using SystemTopology. I therefore propose that the setting of this variable is removed from the run script.

  22. David Radice
    • removed comment

    Ian, Erik,

    hwloc is active in my ET_2014_11 runs, so this explains the difference. The two runs I mentioned before (with ET_2014_11 and ET_2015_05) are too different to meaningfully compare timers: I only verified that the ET_2014_11 was correctly using all CPU cores, while the ET_2015_05 without SystemTopology was placing all processes on the first socket.

    It seems to me that activating SystemTopology should be the default behavior, because running on a workstation is not a common use case. This was also the old behavior of hwloc, so migrating from ET_2014_11 to ET_2015_05 brakes many production setups. When running on workstations, I used to remove hwloc from my thornlist exactly to avoid the problem Erik mention above.

  23. Frank Löffler
    • removed comment

    Replying to [comment:24 dradice@…]:

    It seems to me that activating SystemTopology should be the default behavior, because running on a workstation is not a common use case.

    I would disagree with that. Running on a workstation is probably one of the the first things new users do, and this has to work out of the box at reasonable speed. For production, of course, that is not that common.

    Is there a reason why activating SystemTopology couldn't also work correctly on a workstation? Or why not using it would be that bad on a supercomputer (assuming KMP_AFFINITY isn't set)?

    I am not against using SystemTopology, or using it as default, but Cactus/simfactory should also make reasonable choices without it, on both regular workstations and supercomputers.

  24. David Radice
    • removed comment

    Replying to [comment:26 knarf]:

    Replying to [comment:24 dradice@…]:

    It seems to me that activating SystemTopology should be the default behavior, because running on a workstation is not a common use case.

    I would disagree with that. Running on a workstation is probably one of the the first things new users do, and this has to work out of the box at reasonable speed. For production, of course, that is not that common.

    Activating SystemTopology would only be harmful if the user is running multiple independent Cactus simulations on the same workstation. In that case, SystemTopology would place all Cactus instances on the same core and degrade performances significantly.

  25. Frank Löffler
    • removed comment

    Replying to [comment:27 dradice@…]:

    Activating SystemTopology would only be harmful if the user is running multiple independent Cactus simulations on the same workstation. In that case, SystemTopology would place all Cactus instances on the same core and degrade performances significantly.

    That is like - when running testsuites in parallel. I don't do that often though, but we did develop a script for it once. What would happen if SystemTopology isn't used in that case - the OS would distribute the processes, wouldn't it?

    Would it be reasonable to let SystemTopology (by default) not enforce pinning when using "less than one full node"? Would it even be possible to always detect that situation correctly? Would not using SystemTopology in that case mean a performance hit? If so, how big would it probably be?

  26. Ian Hinder reporter
    • removed comment

    I think the question of whether SystemTopology should be activated automatically is distracting from the important aspect of this ticket. I have proposed removing the KMP_AFFINITY line from the simfactory run script, which would directly solve the problem here. That line is also wrong, and useless for achieving what needs to be done. Can I suggest that the question of activating SystemTopology automatically be moved to another ticket?

  27. David Radice
    • removed comment

    Ian, yes please move the comments on SystemTopology and sorry for hijacking your ticket!

  28. Frank Löffler
    • removed comment

    Replying to [comment:29 hinder]:

    I think the question of whether SystemTopology should be activated automatically is distracting from the important aspect of this ticket. I have proposed removing the KMP_AFFINITY line from the simfactory run script, which would directly solve the problem here. That line is also wrong, and useless for achieving what needs to be done. Can I suggest that the question of activating SystemTopology automatically be moved to another ticket?

    You are right. Unless someone else speaks against removing KMP_AFFINITY within a few days, please go ahead and do so. The only reported case of bad performance when leaving it out and using hwloc/SystemTopologu couldn't be reproduced.

  29. Ian Hinder reporter
    • removed comment

    Replying to [comment:30 dradice@…]:

    Ian, yes please move the comments on SystemTopology and sorry for hijacking your ticket! No problem - you were not the first person to raise the issue of activating SystemTopology automatically. I am not a proponent of that course of action (it cannot handle the machine-dependent case of socket affinity correctly, as this must be done before MPI_Init, and the core affinity is balanced automatically by the OS, giving a suboptimal but not terrible performance, and it is confusing when more than one simulation runs on a node), so if someone thinks it is a good idea, please create a ticket and include why you think this should be done.

  30. Erik Schnetter
    • removed comment

    It seems the correct solution is to use either tacc_affinity or numactl in the run script, and to NOT set KMP_AFFINITY. Alternatively, letting Slurm define the affinity and doing nothing else should also work, if Slurm is set up correctly, which it likely is.

  31. Log in to comment