Simfactory job parameters are not consistent

Create issue
Issue #2483 new
Steven R. Brandt created an issue

According to simlib.py:

NUM_THREADS = threads per mpi proc (thread/mpi_proc)

PPN is supposed to be the number of processors, or cores requested from the scheduler per node. (core/node)

PPN_USED is supposed to be the number of cores actually used per node. (core/node)

NUM_SMT is supposed to be threads per core, and has a value of either 1 or 2 on all machines. (thread/core)

Thus

NODE_PROCS := PPNUSED * NUM_SMT/ NUM_THREADS

This follows since: NODE_PROCS = (cores/node)*(threads/core)/(threads/mpi proc) = mpi procs/node

Now here’s the problem.

NUM_PROCS = PROCS / NUM_THREADS

Now both --procs and --cores are two options for the same thing in simfactory. Thus “procs” is “processors” and “num_procs” is “number of processes.” That’s confusing, but that’s not the problem this ticket is about.

NUM PROCS is supposed to be the number of mpi processes. However, since --procs and --cores are the same thing:

NUM_PROCS = CORES / NUM_THREADS

= cores / (threads / mpi proc)

This is inconsistent. One would expect:

NUM_PROCS = NUM_SMT*CORES/NUM_THREADS

‌ = (threads/core)*cores/(threads/mpi proc).

What if we define NUM_THREADS as cores/mpi proc? Well, apart from being confusing, that makes the NODE_PROCS calculation wrong.

So, unless I’m missing something, these parameters are not consistent, regardless of how you define them. They only work if NUM_SMT is one and cores and threads are interchangeable.

Is that always true?

The following machines have: max-num-smt = 2 are bethe, cori, philip, and supermucng. Looking at simlib.py, this parameter is not accessed! Instead, simlib.py only attempts to get ‘num-smt’, a parameter no ini file ever sets. Thus, the NUM_SMT is, essentially, always 1.

What to do?

My suggestion is that the definition of NUM_PROCS be ammended to be

NUM_PROCS = CORES * NUM_SMT / NUM_THREADS

‌ so that cores*(threads/core)/(threads/mpi_proc)
And then I suggest that the feature is tried out on one of the above 4 machines by changing max-num-smt to num-smt (note, however, that philip no longer exists).

Comments (2)

  1. Steven R. Brandt reporter

    I guess this also “works” if --cores and --procs both really mean total threads. In that case, I’m not sure which name is more misleading. We could mark both as deprecated and allow a --total-threads option. That would be the least invasive change.

  2. Log in to comment