Validate ARM64 architecture

Issue #237 resolved
Dan Bonachea created an issue

GASNet-EX supports the ARM64 architecture (aarch64).
UPC++ currently does not advertise any support for ARM, but we have reason to believe this architecture is becoming increasingly relevant.

Nightly tests on an ARM64/ibv system provide some evidence that UPC++ support on ARM64 is possible/working.

We should perform further exploratory UPC++ testing on ARM64 to decide whether to pursue an official support claim (which includes determining floor compiler versions).

Comments (4)

  1. Paul Hargrove

    I have access to two ARM64/InfiniBand systems.

    • "Mustangs" at JLSE (ANL) on which we already run twice weekly tests with UPC++ and g++-7.3.0.
    • "Wombat" at ORNL on which we cannot (or could not?) run tests because GASNet was crashing the IB drivers.

    The "mustangs" system is just 2 nodes, with 8-core each.
    Meanwhile, "wombat" is a 16-node cluster, having (2 CPUs * 28-cores * 4-way SMT) = 224 threads per node.

    I have just now completed installing gcc and clang versions that will allow me to test the same "floor" versions we have documented for other platforms, as well as the newest of each compiler family. I will report here when I have attempted to verify UPC++ with each of the 4 compilers, using one or both systems.

  2. Paul Hargrove

    TL;DR: 👍

    Initial testing looks good and I will proceed with plans for automated testing, with the intent that we gain confidence to document ARM64 support in our 2019.9.0 release.

    Full version:

    Overnight, both systems ran our CI for all four compilers: {clang,gcc}X{oldest,newest}.
    On Mustangs, runs used ibv-conduit and ran 4 ranks (2ppn on the 2 nodes) in mosts tests.
    On Wombat, rans used udp-conduit and ran 8 ranks (1ppn).

    The results are "clean", other than two instances of SEGV from issue138-par in O3 mode, which we've seen before.

    This batch of "one-shot" tests are not a replacement for regular automated testing, which may show timing-dependent failures or other rare events. However, IMO, this initial testing indicates we should pursue listing this as a supported platform in the next release.

    So, my next task is to determine how best to pursue automated testing...

    The 2-node nature of Mustangs makes it barely sufficient, and it is old hardware (not HPC-relevant). Additionally, an agreement in place with the admin to limit my automated use, to avoid impacting other users, would need to be "renegotiated" if I were to run more than my current 5-hour slot twice per week. So, this system can be a component of our testing on this architecture, but seems unsuited to being the core of it.

    The tests last night on Wombat used udp-conduit, because we had previously been crashing the IB drivers. So, I think my first priority relative to this issue is to work with the admin to see if the current IB drivers are stable. If so, then I can deploy nightly testing over ibv-conduit. Otherwise, udp-conduit is better than not testing.

    I should note that I also have access to one single-node ARM64 system where we've been successfully running UPC++ tests over smp- and udp-conduit (disabling shared-memory for the latter). However, like Mustangs, that is also an 8-core node (not very HPC-relevant).

  3. Paul Hargrove

    The Mustangs are now testing the four compilers in rotation, one each day on Mon...Thu.
    Proposed changes to README.md and utils/system-checks.sh are now in PR 100, which is on hold pending sufficient test history.

  4. Log in to comment