When using clang with AVX enabled, TestBench fails
I'm trying to run TestBench
on FreeBSD machine. The system compiler is clang-6.0.1 and the CPUs are Ivy Bridge Xeons.
When x265 is compiled with -march=ivybridge
, TestBench
fails:
Using random seed 5B4D0F8C 12bit
Testing primitives: SSE2
Testing primitives: SSE3
Testing primitives: SSSE3
cuTreeFix8Pack failed
x265: asm primitive has failed. Go and fix that Right Now!
*** Error code 255
Using an arch-string without AVX (such as -march=core2
), or explicitly adding -mno-avx
allows the test-run to succeed. gcc8 works too -- with better timings (using -O3
in all cases):
- gcc8
-march=ivybridge
:
7.500u 0.118s 0:07.64 99.6% 6574+7681k 0+0io 0pf+0w
7.500u 0.118s 0:07.64 99.6% 6574+7681k 0+0io 0pf+0w
- clang-6
-march=ivybridge -mno-avx
7.610u 0.126s 0:07.78 99.3% 7078+7673k 0+0io 0pf+0w
7.643u 0.094s 0:07.79 99.2% 6955+7713k 0+0io 0pf+0w
- clang-6 without
-march
8.346u 0.157s 0:08.57 99.0% 7577+7727k 0+0io 0pf+0w
8.402u 0.117s 0:08.58 99.1% 7383+7752k 0+0io 0pf+0w
Comments (11)
-
-
I explore more about this topic, let's see different in these compiler option.
-march=ivygridge -mno-avx movupd xmm2, xmmword ptr [r14] ; load src[] mulpd xmm2, xmm0 ; *= 256.0 movapd xmm3, xmm2 subsd xmm3, xmm1 ; -= double 9.2233720368547758E+18. (*Note 1*) cvttsd2si rbx, xmm3 xor rbx, r10 ; -= -9223372036854775808 cvttsd2si rcx, xmm4 ucomisd xmm4, xmm1 cmovae rcx, rbx movq xmm3, rcx -march=ivygridge vmovups xmm5, xmmword ptr [rcx] vinsertf128 ymm5, ymm5, xmmword ptr [rcx + 16], 1 vmulpd ymm5, ymm5, ymm0 ; *= 256.0 vcvttpd2dq xmm5, ymm5 vpshufb xmm2, xmm2, xmm1 vmovq qword ptr [rax - 24], xmm2
Ah, we can found the root cause is VCVTPD2DQ, in the Intel documents, they said,
"When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned." ```
As you see, this is Signed convert, but x265 want to get a Unsigned output. I guess that's a compiler bugs.
Note 1: What's that compiler magic number? We can be get from https://stackoverflow.com/questions/13734191/are-there-unsigned-equivalents-of-the-x87-fild-and-sse-cvtsi2sd-instructions
-
Can we come to some sort of simplified test case, which we can submit upstream at llvm.org's bug tracker?
For example, just the function cuTreeFix8Pack, compiled with -O3 -mavx should already show the problem, I think.
-
Ok, after talking to some people on the LLVM IRC channel, it seems that the line:
dst[i] = (uint16_t)(src[i] * 256.0);
can lead to undefined behavior, if
src[i] * 256.0
does not fit into auint16_t
. I added a runtime check for this, and indeed the very firstsrc[0]
value is already a negative double value, which won't fit inuint16_t
.Apparently, in clang 7.0 there is going to be a new flag
-fno-strict-float-cast-overflow
, which attempts to simulate the native (x86) behavior, see https://clang.llvm.org/docs/ReleaseNotes.html#new-compiler-flagsThe following patch avoids the undefined behavior, and seems to work for me, at least:
--- a/common/pixel.cpp +++ b/common/pixel.cpp @@ -922,7 +922,7 @@ static void estimateCUPropagateCost(int* dst, const ui static void cuTreeFix8Pack(uint16_t *dst, double *src, int count) { for (int i = 0; i < count; i++) - dst[i] = (uint16_t)(src[i] * 256.0); + dst[i] = (uint16_t)(int16_t)(src[i] * 256.0); } static void cuTreeFix8Unpack(double *dst, uint16_t *src, int count)
-
Any issue with getting this committed?
-
Thank you. I am requesting you to send a patch to public ML. so that we can commit this patch immediately.
-
Hi Ashok, I tried posting to x265-devel@videolan.org, but I don't see the message appearing. I did get a reply that it would be moderated, though, how long does that normally take?
-
Couldn't reply because yesterday was holiday for us because of Independence day. I have seen your patch and push it today itself.
-
Dimitry, I couldn't apply your patch, so Praveen sent the patch for your fix mentioning your name. Thanks.
-
- changed status to closed
-
Ashok, thanks for taking care of this.
- Log in to comment
It looks a float operator optimize mismatch, because AVX is SIMD on float-point field. I will take a look more late.