SimFactory should detect the number of cores automatically

Issue #2059 resolved
Ian Hinder
created an issue

SimFactory should detect the number of cores automatically. See pull request by Mikael Sahrling: https://bitbucket.org/simfactory/simfactory2/pull-requests/20/simdtpy-detect-number-of-cpus/diff.

Keyword:

Comments (30)

  1. Roland Haas
    • removed comment

    This outputs a warning from popen "sh: 1: lscpu: not found" which is too low-level. It's warning output is also done using warning("") to bracket things. This should not be done as the empty warnings are logged. It uses the "Core(s) per socket" value to set ppn which is incorrect since ppn is the number of cores by compute node (not NUMA node, ie socket).

    It would also be useful to set the default number of threads to the number of cores per NUMA domain. lscpu is not quite consistent in its naming, eg on a BW compute node (2 physical sockets, 4 numa domains, 32 cores eg in /proc/cpuinfo) it gives:

    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                32
    On-line CPU(s) list:   0-31
    Thread(s) per core:    2
    Core(s) per socket:    8
    Socket(s):             2
    NUMA node(s):          4
    Vendor ID:             AuthenticAMD
    CPU family:            21
    Model:                 1
    Stepping:              2
    CPU MHz:               2300.000
    BogoMIPS:              4599.95
    Virtualization:        AMD-V
    L1d cache:             16K
    L1i cache:             64K
    L2 cache:              2048K
    L3 cache:              6144K
    NUMA node0 CPU(s):     0-7
    NUMA node1 CPU(s):     8-15
    NUMA node2 CPU(s):     16-23
    NUMA node3 CPU(s):     24-31
    

    ie "Core(s) per socket" is the number of cores per NUMA domain but "Core(s) per socket" * "Socket(s)" is not equal to "CPU(s)". Instead one has to take the "Thread(s) per core" number into account (makes sense, the CPUs use a kind of hyperthreading).

    So ppn is set incorrectly should set num-threads * have to handle hyper-threading cpus

  2. Roland Haas
    • removed comment

    Replying to [comment:6 sbrandt]:

    Roland, can you add a comment about why the fix was deleted? Yes, sorry, I had not included the text in the change of status, I was still writing it at that time.

  3. Roland Haas
    • removed comment

    This is a new branch, yes, not the one from the pull request (https://bitbucket.org/simfactory/simfactory2/pull-requests/20/simdtpy-detect-number-of-cpus/diff)?

    Comments: on should not assume that lscpu is in /usr/bin/lscpu instead one should just make sure that output to stderr is captured as well as output to stdout and include both in the warning logged if there is an error. This likely needs more than just os.popen (see https://docs.python.org/2/library/subprocess.html#popen-constructor) NumberOfCores is still incorrect eg on BW it should be 32 but the math in the patch gives 16 (and 16 is no sensible number of BW for any of the "core" related settings, 8 or 32 are) NumberOfThreads is the correct number (based on how it is used later) but somewhat confusing. As it is the max-num-threads value (identical to ppn) num-threads is set to NumberOfCores (also an odd name, cores per what?) which is computed as 16 which is wrong eg on BW where it should be 8

    Note that I am using BW as an example since it has a complex NUMA layout and has hyperthreading support as well (extra wrinkles).

  4. Steven R. Brandt
    • removed comment

    Sorry for using a new branch. I thought it was the old one based on the name.

    Please give the formulas you would like to use for ppn, num-thread, max-num-threads. Thanks.

  5. Roland Haas
    • removed comment

    I agree that we do not have to worry about the clusters and really only used the BW cpus as a particular annoying example.

    Unfortunately it is not much better on my workstation (the one physically in my office):

    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                24
    On-line CPU(s) list:   0-23
    Thread(s) per core:    2
    Core(s) per socket:    6
    Socket(s):             2
    NUMA node(s):          2
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 44
    Model name:            Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
    Stepping:              2
    CPU MHz:               1596.000
    CPU max MHz:           2661.0000
    CPU min MHz:           1596.0000
    BogoMIPS:              5320.33
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              12288K
    NUMA node0 CPU(s):     0-5,12-17
    NUMA node1 CPU(s):     6-11,18-23
    

    or even on my laptop (thinkpad EXXXX, Core i7, from last year)

    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                4
    On-line CPU(s) list:   0-3
    Thread(s) per core:    2
    Core(s) per socket:    2
    Socket(s):             1
    NUMA node(s):          1
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 78
    Model name:            Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
    Stepping:              3
    CPU MHz:               641.271
    CPU max MHz:           3100.0000
    CPU min MHz:           400.0000
    BogoMIPS:              5184.00
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              4096K
    NUMA node0 CPU(s):     0-3
    

    So my workstation is claimed to have 6 cores per socket, 2 sockets and 24 cpus (hyperthreading), so this dos happen to (somewhat regular) workstations not just on clusters. Same for my laptop. I am not making these up :-).

    Admittedly both on my workstation and my laptop I tend to ignore the hyperthreading cores (but not on BW) though LoopControl does have explicit support for hyperthreading so Cactus would support it.

  6. Steven R. Brandt
    • removed comment

    Roland, please tell me what you want me to do. Should I set ppn=max-num-threads=num-threads=total hyperthreads? Normally, I prefer #cores for this.

    It is my belief that even if these numbers are not perfect, they are better than putting "1" (which blocks parallel make, etc.).

  7. Roland Haas
    • removed comment

    Hello Steve,

    no problem about the branch, just so long as I looked at the correct one.

    Sorry got sidetracked. I guess given that we are more likely to find a i5 or so on a laptop than a AMD Istanbul processor and that we normally ignore hyperthreading cores, one should not after all take BW as the example but rather the latp.

    So maybe the thing to do would be to take "Core(s) per socket" as the "num-threads" value which should be a useful number of threads per NUMA domain, then make ppn equal to "CPU(s)" and also make max-num-threads equal to that number.

    This will mean that simfactory by default will use "Core(s) per socket" threads per MPI rank and allow you to use all the cores on a machine (either by spawning enough MPI ranks which is limited by ppn or by setting --num-threads to max-num-threads).

    In the code this would be achieved by:

            NumberOfCores = CoresPerSocket
            NumberOfThreads = ThreadsPerCore*CoresPerSocket*Sockets
    

    which are still confusing names.

    I pushed two changes into your branch. One that implements those formulae and one that uses subprocess.Popen (available since python 2.4 so safe to use) instead of os.popen to capture both stdout and stderr. I cannot use simfactory's simlib.ExecuteCommand (which would be preferred) since that one requires a valid machine definition file.

  8. Steven R. Brandt
    • removed comment

    Question: Did you intend to leave "/usr/bin/lscpu" as is, or did you intend to replace it with "lscpu"?

    Also, I'm not sure what happened, but we agreed to use the code below in bin/sim in ticket #2058. Somehow, it isn't what I checked in, the lower version is missing. If we are to use subprocess.Popen, we need to make the minimum version 204 instead of 203. Version 2.3 was suggested because of QueenBee, but that has 2.6 now. I updated the branch to include the minimum and set it to 204.

    # Forward the call
    for PYEXE in python python2
    do
        $PYEXE - > /dev/null 2>&1 << EOF
    import sys
    if sys.hexversion >= 0x2030000 and sys.hexversion < 0x3000000:
        exit(0)
    else:
        exit(1)
    EOF
        if [ $? = 0 ] 
        then
            exec $PYEXE "$cmd" "$@"
            break
        fi  
    done
    
  9. Roland Haas
    • removed comment

    Well, I had already complained about it. I think that the best one would be "lscpu" and not "/usr/bin/lscpu". All systems where it existed (various Linux laptops, workstations, BW, stampede2, my macos laptop) have it in /usr/bin/lscpu.

    So "yes, please change to just 'lscpu'".

    For versions, ah, right. Hmm, everything has 2.6 by now which is why I had in mind that we are requiring python 2.6, I don't know if we accidentally are already using something from 2.6 or not somewhere else.

    I don't have python 2.3 anymore to test this, though looking at the changes from 2.3 to 2.6 it seems that (with the exception of one line in sim-distribute.py) the code was still python 2.3 compliant. So feel free to revert my last patch. One can achieve (hopefully) the same thing using the older (now deprecated) popen2.popen3 function (https://docs.python.org/2.6/library/popen2.html) though of course it being deprecated I expect to eventually have to provide a wrapper around subprocess.Popen that emulates popen3.

  10. Erik Schnetter
    • removed comment

    In general, MPI parallelization is more efficient than multi-threading. The reverse is only true once we hit scalability limits. Thus there is an argument to be made to use only MPI for single-node runs, and always set num-threads=1.

  11. Roland Haas
    • removed comment

    True, a very point. I attach output files to back up the stated speed improvement for 12 MPI ranks and 1 threads vs. 2 MPI ranks and 6 threads on my workstation. Speed difference is (for static_tov.par) 200Msun/hr vs. 250Msun/hr so ~25% speedup. I also tested 12 threads (hyperthreading) which (keeping the parfile unchanged) makes it much slower: 10M/hr.

    Question is then: do we want this? Ie having simfactory's auto-detect logic set up an "unusual" setting that differs from what is typically used on clusters? Mostly interesting if (which I assume will happen) to set up a new cluster, people will start from "sim setup" on the login node. Typically I am not concerned with speed on individual workstations, reproducibility and debug-ability are much more important to me on my workstation (since I used it to debug and test code but not to run simulations).

    Testsuites may be a issue (namely that we want both multiple MPI ranks and multiple threads in a testsuite run) but those always can be enforced by passing --num-threads and --procs options to simfactory.

    So MPI only gives more speed (very nice). It is less "common" on clusters which may be an issue if "sim setup" is used as a base for cluster installations (it will have to be modified anyway).

    I tended to prefer having something closer to a cluster setup, but really see the advantages of having faster evolutions so could certainly live with it. In particular for actual laptops/workstations only. Is there anyone who cannot live with either option?

  12. Steven R. Brandt
    • removed comment

    I've pushed an update to the branch that uses simenv.popen rather than subprocess. It seems to me that Simfactory was designed around the use of this function, and it doesn't make sense to do things differently in one place. If one wants to introduce subprocess, it should be implemented inside simenv.

    As for the wrangling over settings. Getting ppn and max-num-threads right is a win, IMHO, as having those wrong blocks you from doing parallel make, etc. I'm happy with num-threads being whatever.

  13. Roland Haas
    • removed comment

    Steve: please apply (the version where you used popen again since subprocess is to "new", ie past python 2.3). My suggestion is still to use as many MPI ranks as there there NUMA domains and threads inside, even if this is slower, since this gives a setup more similar to what we usually have on clusters so that one can eg usefully debug OpenMP code. Speed is reduced by that but then speed on a workstation should not be an issue I believe.

    I would suggest to set a deadline by which comments need to be in then commit this.

    Since we are unlikely to reach consensus (https://tools.ietf.org/html/rfc7282#section-2) in the sense that we all agree on one solution: I would ''prefer'' one I have outlined above but do not ''object'' to the one that uses pure MPI (and that Erik seems to favor).

  14. Ian Hinder reporter
    • removed comment

    There are a number of other machine definition entries which this branch is not yet setting:

    • spn: Sockets per node; this can be obtained from lscpu's "Socket(s)" line.
    • max-num-smt: probably this should be the maximum number of hyperthreads per core on the machine. This can be obtained from lscpu's "Thread(s) per core" line.
    • num-smt: probably the default number of hyperthreads to use (we probably want to set this to 1, or leave it out, since it defaults to 1 in simfactory/etc/syntax/mdb-syntax.ini).

    In this branch, we are setting the machine definition entries

    • ppn
    • num-threads
    • max-num-threads

    However, I find that I don't actually know what all of these mean, in the context of the machine definition.

    • num-threads: When running a simulation, num-threads is the number of OpenMP threads to use on each MPI process. In a machine definition, it is the default, i.e. suggested, value which will be used when running simulations unless overridden by the user. I would probably set it to CoresPerSocket * (num-smt), even if it is slightly less efficient than using pure MPI, for simplicity.
    • max-num-threads: This is presumably the maximum value that SimFactory will allow you to use for num-threads when running a simulation. Is this then supposed to stop you from oversubscribing by default? Is this a property of the hardware, i.e. num-smt * CoresPerSocket * spn, or is it a suggested maximum, for machines where you might want to oversubscribe?
    • ppn: I am not sure precisely what this means. In simfactory/etc/syntax/mdb-syntax.ini, this is described as "processors (cores) per node times number of hyperthreads", meaning CoresPerSocket * spn * max-num-smt, suggesting that this is a property of the hardware, but it might also be the suggested/default for the --ppn simulation option. The user can specify --ppn when running a simulation, and there are min-ppn and max-ppn machine entries. So maybe max-ppn is the property of the hardware? Or maybe it is a recommended maximum?

    There are also min-ppn and max-ppn.

    Erik: would you mind clarifying the meaning of all these entries in the machine definition, so that we can be clearer about how to work out what to set them to? Specifically, I would like to know which of these are absolute properties of the hardware, and which are suggested values for runtime simulation properties.

  15. Roland Haas
    • removed comment
    Thus on OSX the auto-generated machine.ini file always claims {{{ppn=1}}} instead one seems to be able to query sysctl (see https://coolaj86.com/articles/get-a-count-of-cpu-cores-on-linux-and-os-x/)
    {{{
    sysctl -n hw.ncpu
    8
    
    sysctl -n hw.physicalcpu
    4
    
    sysctl -n hw.logicalcpu
    8
    

    or just sysctl hw and something like sysctl hw.packages for the number of sockets.

    Note that the current implementation is buggy since it uses lscpu's "Core(s) per socket" value for ppn which is the number of cores per node (so is wrong on a dual socket workstation).

  16. Ian Hinder reporter
    • removed comment

    I fixed the cores-per-socket issue, and added support for Mac OS. Pull requests approved by Roland. Ignored hyperthreading for now.

  17. Roland Haas
    • removed comment

    Is there anything left to do here? Seems to me that7bfd423 "simdt.py: Add Mac OS support for detecting CPU properties" provides a working solution.

  18. Log in to comment