Concern over startup cost of CCS primary segment validation

Issue #593 resolved
Paul Hargrove created an issue

In our 2023.03.21 Pagoda meeting we realized that the current logic for CCS is probably performing an unconditional checksum of the primary code segment (normally the text segment of the executable) on most Linux systems (unless --build-id is a default linker flag). In the absence of such mechanisms such as Slurm's sbcast this means demand-paging of the entire code segment over the network filesystem by every single UPC++ process (simultaneously, but not "cooperatively").

We discussed some means to potentially avoid this cost:

  1. Link the application with -Wl,--build-id on Linux (equivalent is default on macOS). This works because it creates a .note.gnu.build-id section in the ELF executable, which UPC++ uses in place of the checksum operation. Automation of this will be the subject of a distinct RFE. However, it is something a user can do on their own to avoid the non-scalable I/O.

  2. Change our runtime to only compute the checksum in a DEBUG build, since it is currently unused in an OPT build. However, there was discussion of the value of performing primary segment validation unconditionally, which would make this suggestion useless.

Comments (3)

  1. Rob Egan

    On perlmutter starting 1024 nodes @ 128 PPN, I see

    upcxx::init Before=0.00/0.53/0.84/1.30, 424:47506 bal=0.65
    upcxx::init After=11.56/12.32/12.01/12.86 s, 0.93
    upcxx::init FirstBarrier=0.00/0.00/0.00/0.00 s, 0.89 reduct 1.2

    Where the executables all started within an average of 0.84 s and max of 1.3s (assuming all the system clocks are in sync) and upcxx::init completed in an average of 12.01 max 12.86 seconds.

    Over many runs this max time ranges from 12 to 37 seconds for 1024 node runs and 9 to 30 (and one outlier 109) seconds for 128 node runs. 10-11 seconds for 4 node runs and 6 seconds for a single node (one outlier of 33s). To be clear I do not think this is excessive, but would be nice if it could be reduced.

    With this option added to the final linking, I am unable, on perlmutter to detect a difference on 1 or 2 nodes, but think that this will likely squash at least some of these outliers that happen maybe 10% of the time on large runs. I used the nightly build.

  2. Rob Egan

    I didn’t see any measurable improvement on crusher @ 128 nodes either, but I think the idea behind it is worthwhile. I’ll be putting this into cmake for MHM2 and upcxx-utils.

  3. Log in to comment