When using clang with AVX enabled, TestBench fails

Issue #422 closed
Former user created an issue

I'm trying to run TestBench on FreeBSD machine. The system compiler is clang-6.0.1 and the CPUs are Ivy Bridge Xeons.

When x265 is compiled with -march=ivybridge, TestBench fails:

Using random seed 5B4D0F8C 12bit
Testing primitives: SSE2
Testing primitives: SSE3
Testing primitives: SSSE3
cuTreeFix8Pack failed

x265: asm primitive has failed. Go and fix that Right Now!
*** Error code 255

Using an arch-string without AVX (such as -march=core2), or explicitly adding -mno-avx allows the test-run to succeed. gcc8 works too -- with better timings (using -O3 in all cases):

  • gcc8 -march=ivybridge:
7.500u 0.118s 0:07.64 99.6%     6574+7681k 0+0io 0pf+0w
7.500u 0.118s 0:07.64 99.6%     6574+7681k 0+0io 0pf+0w
  • clang-6 -march=ivybridge -mno-avx
7.610u 0.126s 0:07.78 99.3%     7078+7673k 0+0io 0pf+0w
7.643u 0.094s 0:07.79 99.2%     6955+7713k 0+0io 0pf+0w
  • clang-6 without -march
8.346u 0.157s 0:08.57 99.0%     7577+7727k 0+0io 0pf+0w
8.402u 0.117s 0:08.58 99.1%     7383+7752k 0+0io 0pf+0w

Comments (11)

  1. M CHEN

    It looks a float operator optimize mismatch, because AVX is SIMD on float-point field. I will take a look more late.

  2. M CHEN

    I explore more about this topic, let's see different in these compiler option.

    -march=ivygridge -mno-avx
        movupd  xmm2, xmmword ptr [r14]          ; load src[]
        mulpd   xmm2, xmm0                       ; *= 256.0
        movapd  xmm3, xmm2
        subsd   xmm3, xmm1                        ; -= double 9.2233720368547758E+18. (*Note 1*)
        cvttsd2si   rbx, xmm3
        xor rbx, r10                               ; -= -9223372036854775808
        cvttsd2si   rcx, xmm4
        ucomisd xmm4, xmm1
        cmovae  rcx, rbx
        movq    xmm3, rcx
    
    -march=ivygridge
        vmovups xmm5, xmmword ptr [rcx]
        vinsertf128 ymm5, ymm5, xmmword ptr [rcx + 16], 1
        vmulpd  ymm5, ymm5, ymm0                  ; *= 256.0
        vcvttpd2dq  xmm5, ymm5
        vpshufb xmm2, xmm2, xmm1
        vmovq   qword ptr [rax - 24], xmm2
    

    Ah, we can found the root cause is VCVTPD2DQ, in the Intel documents, they said,

    "When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned." ```

    As you see, this is Signed convert, but x265 want to get a Unsigned output. I guess that's a compiler bugs.

    Note 1: What's that compiler magic number? We can be get from https://stackoverflow.com/questions/13734191/are-there-unsigned-equivalents-of-the-x87-fild-and-sse-cvtsi2sd-instructions

  3. Dimitry Andric

    Can we come to some sort of simplified test case, which we can submit upstream at llvm.org's bug tracker?

    For example, just the function cuTreeFix8Pack, compiled with -O3 -mavx should already show the problem, I think.

  4. Dimitry Andric

    Ok, after talking to some people on the LLVM IRC channel, it seems that the line:

    dst[i] = (uint16_t)(src[i] * 256.0);
    

    can lead to undefined behavior, if src[i] * 256.0 does not fit into a uint16_t. I added a runtime check for this, and indeed the very first src[0] value is already a negative double value, which won't fit in uint16_t.

    Apparently, in clang 7.0 there is going to be a new flag -fno-strict-float-cast-overflow, which attempts to simulate the native (x86) behavior, see https://clang.llvm.org/docs/ReleaseNotes.html#new-compiler-flags

    The following patch avoids the undefined behavior, and seems to work for me, at least:

    --- a/common/pixel.cpp
    +++ b/common/pixel.cpp
    @@ -922,7 +922,7 @@ static void estimateCUPropagateCost(int* dst, const ui
     static void cuTreeFix8Pack(uint16_t *dst, double *src, int count)
     {
         for (int i = 0; i < count; i++)
    -        dst[i] = (uint16_t)(src[i] * 256.0);
    +        dst[i] = (uint16_t)(int16_t)(src[i] * 256.0);
     }
    
     static void cuTreeFix8Unpack(double *dst, uint16_t *src, int count)
    
  5. Ashok Kumar Mishra

    Thank you. I am requesting you to send a patch to public ML. so that we can commit this patch immediately.

  6. Dimitry Andric

    Hi Ashok, I tried posting to x265-devel@videolan.org, but I don't see the message appearing. I did get a reply that it would be moderated, though, how long does that normally take?

  7. Ashok Kumar Mishra

    Couldn't reply because yesterday was holiday for us because of Independence day. I have seen your patch and push it today itself.

  8. Ashok Kumar Mishra

    Dimitry, I couldn't apply your patch, so Praveen sent the patch for your fix mentioning your name. Thanks.

  9. Log in to comment