- edited description
Slurm: converting node list (wrong)
I'm running amp on a cluster with the slurm queuing system and it returns a different environment format than anticipated in amp. This is of course only a problem if you want to run training in parallel over several nodes. In utilities.py it seems that amp expect os.environ['SLURM_NODELIST'] to return a string formatted like 'node[572,578]'. This is not always the case on our system. If nodes come in a sequence it would instead return the equivalent to 'node[572-578]'. The conversion into the correct dictionary form fails and Pexpect fails to communicate to the appropriate nodes.
I found a solution as apparently other people need the same explicit format for the nodes. There is a python package made to unfold slurms (and other programs) node list. It is called “hostlist” and should take care of MANY different notation. You can find it here: https://www.nsc.liu.se/~kent/python-hostlist/. I pip-installed it and made a few changes in utilities.py:
77,78c77,83
< import hostlist
< nodes = hostlist.expand_hostlist(nodes)
___
> if '[' in nodes:
> # Formatted funny like 'node[572,578]'.
> prename, numbers = nodes.split('[')
> numbers = numbers[:-1].split(',')
> nodes = [prename + _ for _ in numbers]
> else:
> nodes = nodes.split(',')
It is of course not so nice that you need to install this as well, but it seems to work for the different cases I could come up with to try (one node job, nodes in sequence, and nodes not in sequence).
There might be a better solution. I don't know if other people have problems with this?
Comments (10)
-
reporter -
reporter - edited description
-
reporter - edited description
-
reporter - edited description
-
reporter - edited description
-
repo owner Let's make an optional dependency on hostlist. That is, within the slurm clause of
amp.utilities.assign_cores
we can put a try clause that tries to import, then use, the hostlist package. If hostlist is not present it goes back to our old behavior. Also we should make it so that if it fails (without hostlist installed) then it should provide feedback that the user may want to install hostlist. -
We’re reporting a similar issue from our HPC users here at the University of Virginia. Currently the code does not handle this particular case correctly. If
SLURM_NODELIST
is e.g.abc1,def2
the assigned node will appear as a single string in the log file:
assigned nodes: ['abc2,def2']
Our users had to patch it by adding an
elif
block to handle a comma-separated string.However, apart from patching the
SLURM_NODELIST
parser, I’d like to point out an alternative solution that I believe is cleaner and foolproof. There is a native slurm commandscontrol show hostnames
that converts aSLURM_NODELIST
string into a list of nodes, one on each line. I provide a minimal working example to illustrate this:$ export A=abc1,def2 # A is equivalent to SLURM_NODELIST $ python >>> import os >>> import subprocess >>> process = subprocess.Popen(['scontrol', 'show', 'hostnames', os.environ['A']], stdout=subprocess.PIPE) >>> output, error = process.communicate() >>> output.decode('ascii').splitlines() ['abc1', 'def2']
By using this method there will be no need for the entire if-else block in
parse_slurm_allocation()
ofutilities.py
. -
repo owner Thank @Ruoshi Sun ! I just put that into the latest commit and it’s working on our system at Brown so far. Would be great if you can verify it works for you too. Also perhaps @Eric Musa might want to verify this works at NERSC?
-
repo owner - changed status to resolved
-
Thank you so much for your very prompt update! I’ve passed this along to the AMP users here.
- Log in to comment