Test failures with oneAPI compiler versions 2022.2.0, 2022.2.1, and 2023.0.0

Issue #603 new
Paul Hargrove created an issue

First the good news:
The defect described in this issue has not been observed to occur with icpx releases older or newer than those listed in the subject line.

The following tests have been observed to run to completion, but print numerically incorrect results in opt codemode when compiled with certain versions of the oneAPI compilers:

  • example/prog-guide/rb1d.cpp
  • example/prog-guide/rb1d-rpc.cpp
  • example/prog-guide/rb1d-rpcinit.cpp
  • upcxx-extras::tutorials/2021-11/examples/jac1d.cpp
  • upcxx-extras::tutorials/2021-11/solutions/ex2.cpp

So far this is occurring for all threadmode and network combinations I've tried, including {seq,par}X{smp,ibv,ofi/cxi}. It has not occurred in any debug codemode trials.

Testing of five oneAPI releases (believed to be consecutive) yields:

  • 2022.1.0 GOOD
  • 2022.2.0 BAD
  • 2022.2.1 BAD
  • 2023.0.0 BAD
  • 2023.1.0 GOOD

Therefore, upgrade to the 2023.1.0 versions of icx and icpx is the recommended work-around.

It is unknown if these are a result of a compiler problem or UB in these test. So, the first task related to this issue should be ruling out UB.

Comments (4)

  1. Paul Hargrove reporter

    Update with today's progress.

    First, a clarification/correction: While I characterized the results as "numerically incorrect", some might not agree that is the right way to describe what is occurring. The output from rb1d is the number of iterations to converge and the max error when converged. The results for the compilers listed show convergence at a different iteration count (and with a different error) than all others compilers tested (oneAPI or otherwise). I've taken this to indicate a numerically different (presumed "incorrect") solution.

    The first output below demonstrates that with the most recent oneAPI compiler release on Dirac, the result is independent of the process count. However, the second shows one of the problematic compiler versions yields a different result for process counts 1, 2 and 4 (4 processes is where the "numerically incorrect results" were observed).

    $ ./B-oneapi-2023.1.0/opt/upcxx/bin/upcxx -network=smp -O ~/upcxx/example/prog-guide/rb1d.cpp -o rb1d-23.1
    $ for i in 1 2 4 8 16; do echo -n "NP=$i    "; env GASNET_PSHM_NODES=$i ./rb1d-23.1; done
    NP=1    Converged at 5590, err 4.99825
    NP=2    Converged at 5590, err 4.99825
    NP=4    Converged at 5590, err 4.99825
    NP=8    Converged at 5590, err 4.99825
    NP=16    Converged at 5590, err 4.99825
    
    $ ./B-oneapi-2023.0.0/opt/upcxx/bin/upcxx -network=smp -O ~/upcxx/example/prog-guide/rb1d.cpp -o rb1d-23.0
    $ for i in 1 2 4 8 16; do echo -n "NP=$i    "; env GASNET_PSHM_NODES=$i ./rb1d-23.0; done
    NP=1    Converged at 6800, err 4.99931
    NP=2    Converged at 6340, err 4.99885
    NP=4    Converged at 5030, err 4.99657
    NP=8    Converged at 5590, err 4.99825
    NP=16    Converged at 5590, err 4.99825
    

    Demonstration that other compiler families (still on Dirac) do not have this behavior (where "intel" below is the "Classic" compilers, not oneAPI):

    $ module load upcxx/2023.3.0
    $ module load PrgEnv/nvidia
    $ upcxx -network=smp -O ~/upcxx/example/prog-guide/rb1d.cpp
    $ for i in 1 2 4 8; do echo -n "NP=$i    "; env GASNET_PSHM_NODES=$i ./a.out; done
    NP=1    Converged at 5590, err 4.99825
    NP=2    Converged at 5590, err 4.99825
    NP=4    Converged at 5590, err 4.99825
    NP=8    Converged at 5590, err 4.99825
    
    $ module swap PrgEnv PrgEnv/gnu
    $ upcxx -network=smp -O ~/upcxx/example/prog-guide/rb1d.cpp
    $ for i in 1 2 4 8; do echo -n "NP=$i    "; env GASNET_PSHM_NODES=$i ./a.out; done
    NP=1    Converged at 5590, err 4.99825
    NP=2    Converged at 5590, err 4.99825
    NP=4    Converged at 5590, err 4.99825
    NP=8    Converged at 5590, err 4.99825
    
    $ module swap PrgEnv PrgEnv/llvm
    $ upcxx -network=smp -O ~/upcxx/example/prog-guide/rb1d.cpp
    $ for i in 1 2 4 8; do echo -n "NP=$i    "; env GASNET_PSHM_NODES=$i ./a.out; done
    NP=1    Converged at 5590, err 4.99825
    NP=2    Converged at 5590, err 4.99825
    NP=4    Converged at 5590, err 4.99825
    NP=8    Converged at 5590, err 4.99825
    
    $ module swap PrgEnv PrgEnv/aocc
    $ upcxx -network=smp -O ~/upcxx/example/prog-guide/rb1d.cpp
    $ for i in 1 2 4 8; do echo -n "NP=$i    "; env GASNET_PSHM_NODES=$i ./a.out; done
    NP=1    Converged at 5590, err 4.99825
    NP=2    Converged at 5590, err 4.99825
    NP=4    Converged at 5590, err 4.99825
    NP=8    Converged at 5590, err 4.99825
    
    $ module swap PrgEnv PrgEnv/intel
    $ upcxx -network=smp -O ~/upcxx/example/prog-guide/rb1d.cpp
    icpc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
    $ for i in 1 2 4 8; do echo -n "NP=$i    "; env GASNET_PSHM_NODES=$i ./a.out; done
    NP=1    Converged at 5590, err 4.99825
    NP=2    Converged at 5590, err 4.99825
    NP=4    Converged at 5590, err 4.99825
    NP=8    Converged at 5590, err 4.99825
    
  2. Paul Hargrove reporter

    Related FYI: I've filed NERSC ticket INC0204441 to request installation of Intel's 2023.1.0 compilers on Perlmutter which would enable us to begin CI testing of PrgEnv-intel w/o the need to address this issue.

  3. Paul Hargrove reporter

    Today I verified that the current 2023.1.0 version of the oneAPI compilers work correctly on Perlmutter. Therefore, I no longer have plans to "fix" this issue. However, this issue remains open pending internal discussions regarding the possibility of rejecting the impacted compiler versions at configure time.

    For the benefit of those who read issue trackers from the bottom-up:

    Upgrade to the 2023.1.0 (or later) versions of icx and icpx is the recommended work-around.
    It is unknown if the incorrect behaviors reported here are a result of a compiler problem or UB in either the tests or UPC++.

  4. Log in to comment