When using clang with AVX enabled, TestBench fails

Issue #422 closed

Former user created an issue 2018-07-16

I'm trying to run TestBench on FreeBSD machine. The system compiler is clang-6.0.1 and the CPUs are Ivy Bridge Xeons.

When x265 is compiled with -march=ivybridge, TestBench fails:

Using random seed 5B4D0F8C 12bit
Testing primitives: SSE2
Testing primitives: SSE3
Testing primitives: SSSE3
cuTreeFix8Pack failed

x265: asm primitive has failed. Go and fix that Right Now!
*** Error code 255

Using an arch-string without AVX (such as -march=core2), or explicitly adding -mno-avx allows the test-run to succeed. gcc8 works too -- with better timings (using -O3 in all cases):

gcc8 -march=ivybridge:

7.500u 0.118s 0:07.64 99.6%     6574+7681k 0+0io 0pf+0w
7.500u 0.118s 0:07.64 99.6%     6574+7681k 0+0io 0pf+0w

clang-6 -march=ivybridge -mno-avx

7.610u 0.126s 0:07.78 99.3%     7078+7673k 0+0io 0pf+0w
7.643u 0.094s 0:07.79 99.2%     6955+7713k 0+0io 0pf+0w

clang-6 without -march

8.346u 0.157s 0:08.57 99.0%     7577+7727k 0+0io 0pf+0w
8.402u 0.117s 0:08.58 99.1%     7383+7752k 0+0io 0pf+0w

Comments (11)

M CHEN
It looks a float operator optimize mismatch, because AVX is SIMD on float-point field. I will take a look more late.
- 2018-07-16T22:49:53+00:00

M CHEN

I explore more about this topic, let's see different in these compiler option.

-march=ivygridge -mno-avx
    movupd  xmm2, xmmword ptr [r14]          ; load src[]
    mulpd   xmm2, xmm0                       ; *= 256.0
    movapd  xmm3, xmm2
    subsd   xmm3, xmm1                        ; -= double 9.2233720368547758E+18. (*Note 1*)
    cvttsd2si   rbx, xmm3
    xor rbx, r10                               ; -= -9223372036854775808
    cvttsd2si   rcx, xmm4
    ucomisd xmm4, xmm1
    cmovae  rcx, rbx
    movq    xmm3, rcx

-march=ivygridge
    vmovups xmm5, xmmword ptr [rcx]
    vinsertf128 ymm5, ymm5, xmmword ptr [rcx + 16], 1
    vmulpd  ymm5, ymm5, ymm0                  ; *= 256.0
    vcvttpd2dq  xmm5, ymm5
    vpshufb xmm2, xmm2, xmm1
    vmovq   qword ptr [rax - 24], xmm2

Ah, we can found the root cause is VCVTPD2DQ, in the Intel documents, they said,

"When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned." ```

As you see, this is Signed convert, but x265 want to get a Unsigned output. I guess that's a compiler bugs.

Note 1: What's that compiler magic number? We can be get from https://stackoverflow.com/questions/13734191/are-there-unsigned-equivalents-of-the-x87-fild-and-sse-cvtsi2sd-instructions

2018-07-18T02:45:33+00:00

Dimitry Andric
Can we come to some sort of simplified test case, which we can submit upstream at llvm.org's bug tracker?

For example, just the function cuTreeFix8Pack, compiled with -O3 -mavx should already show the problem, I think.
- 2018-07-25T16:06:52+00:00
Dimitry Andric
Ok, after talking to some people on the LLVM IRC channel, it seems that the line:
```
dst[i] = (uint16_t)(src[i] * 256.0);
```
can lead to undefined behavior, if src[i] * 256.0 does not fit into a uint16_t. I added a runtime check for this, and indeed the very first src[0] value is already a negative double value, which won't fit in uint16_t.

Apparently, in clang 7.0 there is going to be a new flag -fno-strict-float-cast-overflow, which attempts to simulate the native (x86) behavior, see https://clang.llvm.org/docs/ReleaseNotes.html#new-compiler-flags

The following patch avoids the undefined behavior, and seems to work for me, at least:
```
--- a/common/pixel.cpp
+++ b/common/pixel.cpp
@@ -922,7 +922,7 @@ static void estimateCUPropagateCost(int* dst, const ui
 static void cuTreeFix8Pack(uint16_t *dst, double *src, int count)
 {
     for (int i = 0; i < count; i++)
-        dst[i] = (uint16_t)(src[i] * 256.0);
+        dst[i] = (uint16_t)(int16_t)(src[i] * 256.0);
 }

 static void cuTreeFix8Unpack(double *dst, uint16_t *src, int count)
```
- 2018-07-25T19:26:46+00:00
Dimitry Andric
Any issue with getting this committed?
- 2018-08-04T14:23:06+00:00
Ashok Kumar Mishra
Thank you. I am requesting you to send a patch to public ML. so that we can commit this patch immediately.
- 2018-08-14T08:06:52+00:00
Dimitry Andric
Hi Ashok, I tried posting to x265-devel@videolan.org, but I don't see the message appearing. I did get a reply that it would be moderated, though, how long does that normally take?
- 2018-08-14T21:00:44+00:00
Ashok Kumar Mishra
Couldn't reply because yesterday was holiday for us because of Independence day. I have seen your patch and push it today itself.
- 2018-08-16T10:13:30+00:00
Ashok Kumar Mishra
Dimitry, I couldn't apply your patch, so Praveen sent the patch for your fix mentioning your name. Thanks.
- 2018-08-16T13:11:42+00:00
Ashok Kumar Mishra
- changed status to closed
- 2018-08-16T13:11:55+00:00
Dimitry Andric
Ashok, thanks for taking care of this.
- 2018-08-16T20:17:49+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: closed

Component: –

Version: –

Votes: 0

Watchers: 3