Error when running on multiple nodes.
Hi all,
I run my program on 2 nodes. All setting is ok. I double checked with Admins.
The error is as below
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: c2-5
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: c2-5
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: c2-5
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: c2-5
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: c2-5
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
Admins said that when using OpenMPI then the option is called "--bind-to none" to solve this problem.
I believe that the previous error messages come from the fact that when 2 different jobs run on the same compute node, then they both "think" that they are alone on the compute node and therefore both run on the same first cores on the compute node. One job will succeed in getting the first cores, while the 2nd job will get an error message like the one that you got indicating that cores are not idle.
I did not know which is suitable for UPC++?
Comments (9)
-
-
reporter It worked with mpirun.
I re-install upcxx, and it worked.
Thanks,
-
- changed status to closed
-
reporter - changed status to open
Hi, After I reinstalled upcxx, the program worked with 4 and 8 nodes. However, with 16 and 32 nodes, there are the same errors. If I only run the program, it works. When I run more than two sbatch file, some cores in the assigned nodes have some conflicts as described above.
Could you please give me some suggestions to add "bind-to none" to UPCXX.
Thanks,
-
If you are using ibv-conduit and defaults (you did not correct me earlier when I stated these as my assumptions) then the
upcxx-run
script is ultimately running the$MPIRUN_CMD
described in a previous comment.So, if you followed my previous suggestion then you have already "added
--bind-to none
to UPCXX". If I understand you correctly, the problem is now occurring only in batch jobs. That makes me suspect that the batch job is not running with the new setting of$MPIRUN_CMD
.The easiest thing to try would be to add
export MPIRUN_CMD="mpirun --bind-to none -np %N %C"
to your batch scripts to see if that resolves the problem. If it does not, then you can jump to my next suggestion (experimenting withmpirun
). However, if that resolves the problem, then it means your re-install did not capture the new setting as a default. If that is the case, we can help sort out what went wrong (or you can probably just set this variable in your.bashrc
and leave the install as it is now).If you still have problems after setting the environment variable within the batch job, then I would suggest you experiment using
mpirun
directly (instead ofupcxx-run
) in the sbatch scenarios where you see errors. If use ofmpirun --bind-to none ...
to launch does not eliminate your errors, then you should work with your system administrator(s) to sort out the problem.One think to keep in mind when using
mpirun
directly to launch UPC++ applications is thatupcxx-run
's command line options such as-shared-heap=N
and-backtrace
are not available. If you would normally use those options, let us know and we can provide their respective environment variable equivalents. -
- changed component to Support: Installation
-
reporter HI,
Thank you so much!
This is my mistake. I added
export MPIRUN_CMD="mpirun --bind-to none -np %N %C"
to batch file for 4 and 8 nodes. However, I forgot to put it to batch files for 16 and 32 nodes.
Thanks,
-
reporter - changed status to resolved
-
reporter - changed status to closed
- Log in to comment
You have not provided details of your configuration other than what I could infer from your previous installation-time issue.
So, I am assuming GASNet's default use of
mpirun
to launch ibv-conduit executables.Based on that assumption and the suggestion from your admins, please try setting the following in your environment at application run time:
If that is effective, then you have the option to re-install UPC++ with that same setting in your environment at install time.
That will make this the default value so that you don't need to set it at run time.
If this is not effective, let us know and we'll look for another solution.