upcxx-run (upcxx_srun) can't be used when specialized cores are requested on Cori (-S option)

Issue #178 resolved
Mathias Jacquelin created an issue

upcxx-run fails in that sepcific case. The issue is that the upcxx_srun script which is called by the launcher is not taking the number of specialized cores (salloc -S C -N 1) requested by the user. In this case, the number of processes should be 68-N on Cori KNL. Upcxx_srun rather uses a hardcoded value of 68, which leads to always requesting more than what's available.

Here is a proposed solution:

@@ -35,7 +35,7 @@
 case "$cpu" in
   ivb|ivybridge) cores=24; thr=2;;
   hsw|haswell)   cores=32; thr=2;;
-  knl|mic-knl)   cores=68; thr=4;;
+  knl|mic-knl)   thr=4; cores=$((($SLURM_CPUS_ON_NODE)/$thr));;
     *) echo "ERROR: Unknown cpu type '$cpu'" >&2; exit 1;;
 esac
 ##

Comments (6)

  1. Paul Hargrove

    @mjacquelin Is there any reason the same logic should not apply to Edison (ivb) and Cori-I (hsw)?

    I think the desired change is the following. Do you agree?

    --- upcxx_srun~ 2018-10-30 11:25:37.216659000 -0700
    +++ upcxx_srun  2018-10-30 11:27:33.326565000 -0700
    @@ -33,11 +33,12 @@
     fi
     ##
     case "$cpu" in
    -  ivb|ivybridge) cores=24; thr=2;;
    -  hsw|haswell)   cores=32; thr=2;;
    -  knl|mic-knl)   cores=68; thr=4;;
    +  ivb|ivybridge) thr=2;;
    +  hsw|haswell)   thr=2;;
    +  knl|mic-knl)   thr=4;;
         *) echo "ERROR: Unknown cpu type '$cpu'" >&2; exit 1;;
     esac
    +cores=$(($SLURM_CPUS_ON_NODE/$thr))
     ##
     if test -z "$nnode" || test -z "$nproc"; then
       echo "ERROR: Unable to determine job geometry." >&2
    
  2. Paul Hargrove

    I have made the change I proposed above to the "live" script.
    I have done to basic testing, but am not sure I cover all the cases.

    Please test ASAP to let me know if anything you would normally do is now broken.
    I doubt it is the case, but I should revert soon if anything did break.

  3. Log in to comment