Error when running on multiple nodes.

Issue #363 closed
Ngoc Phuong Chau created an issue

Hi all,

I run my program on 2 nodes. All setting is ok. I double checked with Admins.

The error is as below

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        c2-5
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        c2-5
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        c2-5
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        c2-5
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        c2-5
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.

Admins said that when using OpenMPI then the option is called "--bind-to none" to solve this problem.

I believe that the previous error messages come from the fact that when 2 different jobs run on the same compute node, then they both "think" that they are alone on the compute node and therefore both run on the same first cores on the compute node. One job will succeed in getting the first cores, while the 2nd job will get an error message like the one that you got indicating that cores are not idle.

I did not know which is suitable for UPC++?

Comments (9)

  1. Paul Hargrove

    You have not provided details of your configuration other than what I could infer from your previous installation-time issue.
    So, I am assuming GASNet's default use of mpirun to launch ibv-conduit executables.
    Based on that assumption and the suggestion from your admins, please try setting the following in your environment at application run time:

    MPIRUN_CMD="mpirun --bind-to none -np %N %C"
    

    If that is effective, then you have the option to re-install UPC++ with that same setting in your environment at install time.
    That will make this the default value so that you don't need to set it at run time.

    If this is not effective, let us know and we'll look for another solution.

  2. Ngoc Phuong Chau reporter
    • changed status to open

    Hi, After I reinstalled upcxx, the program worked with 4 and 8 nodes. However, with 16 and 32 nodes, there are the same errors. If I only run the program, it works. When I run more than two sbatch file, some cores in the assigned nodes have some conflicts as described above.

    Could you please give me some suggestions to add "bind-to none" to UPCXX.

    Thanks,

  3. Paul Hargrove

    If you are using ibv-conduit and defaults (you did not correct me earlier when I stated these as my assumptions) then the upcxx-run script is ultimately running the $MPIRUN_CMD described in a previous comment.

    So, if you followed my previous suggestion then you have already "added --bind-to none to UPCXX". If I understand you correctly, the problem is now occurring only in batch jobs. That makes me suspect that the batch job is not running with the new setting of $MPIRUN_CMD.

    The easiest thing to try would be to add export MPIRUN_CMD="mpirun --bind-to none -np %N %C" to your batch scripts to see if that resolves the problem. If it does not, then you can jump to my next suggestion (experimenting with mpirun). However, if that resolves the problem, then it means your re-install did not capture the new setting as a default. If that is the case, we can help sort out what went wrong (or you can probably just set this variable in your .bashrc and leave the install as it is now).

    If you still have problems after setting the environment variable within the batch job, then I would suggest you experiment using mpirun directly (instead of upcxx-run) in the sbatch scenarios where you see errors. If use of mpirun --bind-to none ... to launch does not eliminate your errors, then you should work with your system administrator(s) to sort out the problem.

    One think to keep in mind when using mpirun directly to launch UPC++ applications is that upcxx-run's command line options such as -shared-heap=N and -backtrace are not available. If you would normally use those options, let us know and we can provide their respective environment variable equivalents.

  4. Ngoc Phuong Chau reporter

    HI,

    Thank you so much!

    This is my mistake. I added

    export MPIRUN_CMD="mpirun --bind-to none -np %N %C"
    

    to batch file for 4 and 8 nodes. However, I forgot to put it to batch files for 16 and 32 nodes.

    Thanks,

  5. Log in to comment