magmablas_cher2k() might lead to failure of cheevd_gpu

Issue #46 resolved
Luise Chen created an issue

Hi,

When testing cheevd_gpu on ROCm 4.2, we found magmablas_cher2k() in magma_cher2k() might lead to inaccurate results while using hipblasCher2k() makes the tests PASS. We suggest to consider adopting hipblasCher2k() for magma_cher2k().

On aed4e285084763113ce5757393d4008e27b5194b@master branch, we have the following failures:

root@01821ad0c545:~/magma# ./testing/testing_cheevd_gpu --version 1 -L -JV -c -n 3000 -n 4000
\% MAGMA 2.6.0 svn 32-bit magma_int_t, 64-bit pointer.
\% HIP runtime 3275, driver 327500. OpenMP threads 72.
\% device 0: Device 738c, 1502.0 MHz clock, 32752.0 MiB memory, capability 9.0
\% Fri Jul 2 21:08:30 2021
\% Usage: ./testing/testing_cheevd_gpu [options] [-h|--help]

\% jobz = Vectors needed, uplo = Lower, version = 1 (cheevd_gpu)
\% N CPU Time (sec) GPU Time (sec) |S-S_magma| |A-USU^H| |I-U^H U|
\%============================================================================
3000 --- 10.4894 --- 2.90e-06 6.43e-08 failed
4000 --- 1.7848 --- 1.97e-06 6.85e-08 failed

root@01821ad0c545:~/magma# ./testing/testing_cheevd_gpu --version 1 -U -JV -c -n 4000
\% MAGMA 2.6.0 svn 32-bit magma_int_t, 64-bit pointer.
\% HIP runtime 3275, driver 327500. OpenMP threads 72.
\% device 0: Device 738c, 1502.0 MHz clock, 32752.0 MiB memory, capability 9.0
\% Fri Jul 2 21:08:33 2021
\% Usage: ./testing/testing_cheevd_gpu [options] [-h|--help]

\% jobz = Vectors needed, uplo = Upper, version = 1 (cheevd_gpu)
\% N CPU Time (sec) GPU Time (sec) |S-S_magma| |A-USU^H| |I-U^H U|
\%============================================================================
4000 --- 13.6010 --- 1.93e-06 6.61e-08 failed

The change of enabling hipblasCher2k() in magma_cher2k() makes the cases PASS.

(base) root@01821ad0c545:~/magma# ./testing/testing_cheevd_gpu --version 1 -L -JV -c -n 3000 -n 4000
\% MAGMA 2.6.0 svn 32-bit magma_int_t, 64-bit pointer.
\% HIP runtime 402, driver 330500. OpenMP threads 72.
\% device 0: Device 738c, 1502.0 MHz clock, 32752.0 MiB memory, capability 9.0
\% Fri Jul 2 20:47:29 2021
\% Usage: ./testing/testing_cheevd_gpu [options] [-h|--help]

\% jobz = Vectors needed, uplo = Lower, version = 1 (cheevd_gpu)
\% N CPU Time (sec) GPU Time (sec) |S-S_magma| |A-USU^H| |I-U^H U|
\%============================================================================
3000 --- 13.4700 --- 1.78e-06 6.12e-08 ok
4000 --- 1.6325 --- 1.17e-06 6.66e-08 ok

(base) root@01821ad0c545:~/magma# ./testing/testing_cheevd_gpu --version 1 -U -JV -c -n 4000
\% MAGMA 2.6.0 svn 32-bit magma_int_t, 64-bit pointer.
\% HIP runtime 402, driver 330500. OpenMP threads 72.
\% device 0: Device 738c, 1502.0 MHz clock, 32752.0 MiB memory, capability 9.0
\% Fri Jul 2 20:49:09 2021
\% Usage: ./testing/testing_cheevd_gpu [options] [-h|--help]

\% jobz = Vectors needed, uplo = Upper, version = 1 (cheevd_gpu)
\% N CPU Time (sec) GPU Time (sec) |S-S_magma| |A-USU^H| |I-U^H U|
\%============================================================================
4000 --- 15.1728 --- 1.07e-06 6.69e-08 ok

Comments (6)

  1. Stanimire Tomov

    Thanks for this report!

    I just made the update including a few other BLAS functions that are now in hipBLAS.

    Regarding testing cheevd_gpu we have an error tolerance set that sometimes gives innocuous failures. This looks like one of these cases, as the reported errors are almost the same. It must be a difference due to some roundoff errors.

  2. Luise Chen reporter

    Thanks for your feedbacks.

    I found hipblasZtrmm is still disabled @749d0411f90e8d7f446fae3cd9059bf81735b46e

    We are driving the enablement of hipblas on MAGMA, and thus I would like to know what is the criteria of enabling this function.

    Please tell me if you could share test set which shall PASS before enabling hipblasZtrmm.

    Thanks

  3. Stanimire Tomov

    Yes, this is the last one left and I was going to ask about it if you know any pending optimizations in hipBLAS. The only reason for now not to define it as the magma_ztrmm is that performance is low. In general we define magma_[blas] to whatever we have fastest in the majority of cases and later, in specific files for specific uses (block sizes, etc.), if there is a faster version we will redefine locally in the file.

    This is what I get currently on one of our systems in terms of performance.

    -bash-4.2$ ./testing_dtrmm
    % MAGMA 2.6.0 svn 32-bit magma_int_t, 64-bit pointer.
    % HIP runtime 3212, driver 321200. OpenMP threads 32. 
    % device 0: Vega 20 [Radeon VII], 1801.0 MHz clock, 16368.0 MiB memory, capability 9.0
    % device 1: Vega 20 [Radeon VII], 1801.0 MHz clock, 16368.0 MiB memory, capability 9.0
    % device 2: Vega 20 [Radeon VII], 1801.0 MHz clock, 16368.0 MiB memory, capability 9.0
    % device 3: Vega 20 [Radeon VII], 1801.0 MHz clock, 16368.0 MiB memory, capability 9.0
    % Tue Jul  6 11:06:02 2021
    % Usage: ./testing_dtrmm [options] [-h|--help]
    
    % If running lapack (option --lapack), MAGMA and HIP errors are both computed
    % relative to CPU BLAS result. Else, MAGMA error is computed relative to HIP result.
    
    % side = Left, uplo = Lower, transA = No transpose, diag = Non-unit 
    %   M     N   MAGMA Gflop/s (ms)  HIP Gflop/s (ms)  CPU Gflop/s (ms)   MAGMA error   HIP error
    %===================================================================================================
     1088  1088      0.32 (4036.23)      4.61 ( 279.21)     ---   (  ---  )    9.57e-20         ---      ok
     2112  2112   1378.49 (   6.83)      9.62 ( 979.71)     ---   (  ---  )    6.08e-20         ---      ok
     3136  3136   1798.19 (  17.15)     14.43 (2137.59)     ---   (  ---  )    4.83e-20         ---      ok
     4160  4160   2312.08 (  31.14)     19.22 (3744.74)     ---   (  ---  )    3.05e-20         ---      ok
     5184  5184   2724.87 (  51.13)     24.12 (5775.80)     ---   (  ---  )    2.79e-20         ---      ok
     6208  6208   2484.00 (  96.32)     28.94 (8268.41)     ---   (  ---  )    2.49e-20         ---      ok
    

    Thanks,

    Stan

  4. Luise Chen reporter

    Currently, I could see the Xtrmm performance of HIP blas path is competitive in most of the major cases as following summary on MI100.

    Could you help enlighten us the performance targets of enabling hipblas path on Xtrmm?

  5. Ahmad Abdelfattah

    I see this issue has been inactive for quite some time. The TRMM routines from MAGMA and hipBLAS are performing similarly, and both are now Available in MAGMA. I will mark this as resolved.

  6. Log in to comment