Different PASS/FAILED results due to inconsistent results of intel MKL

Issue #55 resolved
Luise Chen created an issue

We found that some tests like testing_dtrsm with the same CUDA environment on A100-cards deliver different results when CPUs are different.

As shown in the following figure, the test_dtrsm case delivers different results on AMD-EPYC 7742 and INTEL-6154 where we could find the batches of normR values by blasf77_dtrmm and blasf77_daxpy are different while those of normX values of hBdev are consistent.

Here we used 2019.0.5 MKL for blasf77_*.

My questions are:

  1. How do you consider the failures due to inconsistent results of MKL?
  2. how is the error tolerance of each test determined? Are these values strongly suggested to follow?

Comments (2)

  1. Mark Gates

    It would take some investigation to understand exactly what is going on here, but the main problem is that it is a little difficult to generate a well-conditioned, unit diagonal (-DU in your tests), triangular matrix, and the current code doesn’t take into account the matrix’s condition number. The code currently generates a Hermitian positive definite matrix and factors it using Cholesky, which creates a well-conditioned non-unit diagonal triangular matrix, then arbitrarily replaces the non-unit diagonal with a unit diagonal. Using LU might work better — the L is unit diagonal, the U is non-unit diagonal.

  2. Ahmad Abdelfattah

    Hi,

    We are making a sweep over the lingering issues in MAGMA. This one should be resolved as of 17472eb.

    The failures are avoidable using either the Frobenius norm or the infinity norm instead of the max norm for computing normA/R/X. The inconsistent results of MKL are irrelevant to MAGMA, but it is expected that MKL might behave differently between Intel/AMD CPUs (e.g. by disabling certain optimizations on non-intel hardware).

  3. Log in to comment