SEGFAULT in matrix-matrix assignment due to unaligned SIMD instruction operand under specific conditions
The code reproducing the issue:
#include <blaze/Math.h>
#include <iostream>
int main(int, char **)
{
using namespace blaze;
using Real = double;
size_t constexpr NX = 8;
size_t constexpr NU = 1;
std::cout << "SIMDSIZE=" << SIMDTrait<Real>::size << std::endl;
auto H = std::make_unique<SymmetricMatrix<StaticMatrix<Real, NU + NX, NU + NX, columnMajor>>>();
auto Q = submatrix<NU, NU, NX, NX>(*H);
std::cout << "data(H)=" << data(*H) << std::endl;
std::cout << "data(Q)=" << data(Q) << std::endl;
auto LL = std::make_unique<StaticMatrix<Real, NX + NU, NX + NU, columnMajor>>();
auto Lcal = submatrix<NU, NU, NX, NX>(*LL);
std::cout << "data(LL)=" << data(*LL) << std::endl;
std::cout << "data(Lcal)=" << data(Lcal) << std::endl;
Lcal = Q;
std::cout << "H=\n" << *H << std::endl;
std::cout << "LL=\n" << *LL << std::endl;
return 0;
}
Compiler command line:
g++ -std=c++17 -O2 -g -DNDEBUG -march=skylake matrix_assign.cpp
Compiler version: g++ (Ubuntu 8.3.0-6ubuntu1) 8.3.0
Program output:
SIMDSIZE=4
data(H)=0
data(Q)=0
data(LL)=0x557a4727e660
data(Lcal)=0x557a4727e6c8
Segmentation fault (core dumped)
Stack trace:
#0 blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::assign<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul> > (rhs=..., this=<optimized out>) at /usr/local/include/blaze/math/views/submatrix/Dense.h:5907
#1 blaze::assign_backend<blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> (rhs=..., lhs=...) at /usr/local/include/blaze/math/expressions/Matrix.h:1016
#2 blaze::assign<blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true, blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> (rhs=..., lhs=...) at /usr/local/include/blaze/math/expressions/Matrix.h:1101
#3 blaze::smpAssign<blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true, blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> (rhs=..., lhs=...) at /usr/local/include/blaze/math/smp/default/DenseMatrix.h:108
#4 blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::operator=<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> (this=0x7fffffffd2a0, rhs=...) at /usr/local/include/blaze/math/views/submatrix/Dense.h:4711
#5 0x00005555555553a1 in main () at /home/kotlyar/projects/tmp/matrix_assign.cpp:28
Disassembly of the failing function:
Dump of assembler code for function blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::operator=<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true>(blaze::Matrix<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> const&):
0x0000555555555800 <+0>: push %rbp
0x0000555555555801 <+1>: mov %rdi,%rax
0x0000555555555804 <+4>: mov $0x68,%r8d
0x000055555555580a <+10>: mov %rsp,%rbp
0x000055555555580d <+13>: and $0xffffffffffffffe0,%rsp
0x0000555555555811 <+17>: sub $0x220,%rsp
0x0000555555555818 <+24>: mov (%rax),%rcx
0x000055555555581b <+27>: mov (%rsi),%rdx
0x000055555555581e <+30>: mov %fs:0x28,%rdi
0x0000555555555827 <+39>: mov %rdi,0x218(%rsp)
0x000055555555582f <+47>: xor %edi,%edi
0x0000555555555831 <+49>: cmp %rdx,%rcx
0x0000555555555834 <+52>: jne 0x555555555843 <blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::operator=<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true>(blaze::Matrix<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> const&)+67>
0x0000555555555836 <+54>: jmp 0x555555555888 <blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::operator=<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true>(blaze::Matrix<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> const&)+136>
0x0000555555555838 <+56>: nopl 0x0(%rax,%rax,1)
0x0000555555555840 <+64>: mov (%rsi),%rdx
0x0000555555555843 <+67>: add %r8,%rdx
0x0000555555555846 <+70>: vmovupd (%rdx),%ymm4
0x000055555555584a <+74>: vmovupd %ymm4,(%rcx,%r8,1)
=> 0x0000555555555850 <+80>: vmovapd 0x20(%rdx),%ymm5
0x0000555555555855 <+85>: vmovupd %ymm5,0x20(%rcx,%r8,1)
0x000055555555585c <+92>: add $0x60,%r8
0x0000555555555860 <+96>: cmp $0x368,%r8
0x0000555555555867 <+103>: jne 0x555555555840 <blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::operator=<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true>(blaze::Matrix<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> const&)+64>
0x0000555555555869 <+105>: mov 0x218(%rsp),%rdi
0x0000555555555871 <+113>: xor %fs:0x28,%rdi
0x000055555555587a <+122>: jne 0x5555555558ee <blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::operator=<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true>(blaze::Matrix<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> const&)+238>
0x000055555555587c <+124>: vzeroupper
0x000055555555587f <+127>: leaveq
0x0000555555555880 <+128>: retq
0x0000555555555881 <+129>: nopl 0x0(%rax)
0x0000555555555888 <+136>: lea 0x68(%rcx),%rdx
0x000055555555588c <+140>: mov %rsp,%r8
0x000055555555588f <+143>: mov %rdx,%rsi
0x0000555555555892 <+146>: add $0x368,%rcx
0x0000555555555899 <+153>: mov %r8,%r9
0x000055555555589c <+156>: nopl 0x0(%rax)
0x00005555555558a0 <+160>: vmovupd (%rsi),%ymm0
0x00005555555558a4 <+164>: vmovupd 0x20(%rsi),%ymm1
0x00005555555558a9 <+169>: add $0x60,%rsi
0x00005555555558ad <+173>: vmovapd %ymm0,(%r9)
0x00005555555558b2 <+178>: vmovapd %ymm1,0x20(%r9)
0x00005555555558b8 <+184>: add $0x40,%r9
0x00005555555558bc <+188>: cmp %rcx,%rsi
0x00005555555558bf <+191>: jne 0x5555555558a0 <blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::operator=<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true>(blaze::Matrix<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> const&)+160>
0x00005555555558c1 <+193>: nopl 0x0(%rax)
0x00005555555558c8 <+200>: vmovapd (%r8),%ymm2
0x00005555555558cd <+205>: vmovapd 0x20(%r8),%ymm3
0x00005555555558d3 <+211>: vmovupd %ymm2,(%rdx)
0x00005555555558d7 <+215>: vmovupd %ymm3,0x20(%rdx)
0x00005555555558dc <+220>: add $0x60,%rdx
0x00005555555558e0 <+224>: add $0x40,%r8
0x00005555555558e4 <+228>: cmp %rdx,%rcx
0x00005555555558e7 <+231>: jne 0x5555555558c8 <blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::operator=<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true>(blaze::Matrix<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> const&)+200>
0x00005555555558e9 <+233>: jmpq 0x555555555869 <blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::operator=<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true>(blaze::Matrix<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> const&)+105>
0x00005555555558ee <+238>: vzeroupper
0x00005555555558f1 <+241>: callq 0x5555555550e0 <__stack_chk_fail@plt>
End of assembler dump.
Registers:
rax 0x7fffffffd2a0 140737488343712
rbx 0x55555556c660 93824992331360
rcx 0x55555556c660 93824992331360
rdx 0x55555556c2e8 93824992330472
rsi 0x7fffffffd290 140737488343696
rdi 0x0 0
rbp 0x7fffffffd260 0x7fffffffd260
rsp 0x7fffffffd040 0x7fffffffd040
r8 0x68 104
r9 0x7ffff7a21740 140737347983168
r10 0xa 10
r11 0x246 582
r12 0x55555556c5e0 93824992331232
r13 0x7fffffffd3d0 140737488344016
r14 0x0 0
r15 0x0 0
rip 0x555555555850 0x555555555850 <blaze::Submatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>::operator=<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true>(blaze::Matrix<blaze::Submatrix<blaze::SymmetricMatrix<blaze::StaticMatrix<double, 9ul, 9ul, true>, true, true, true>, (blaze::AlignmentFlag)0, true, true, 1ul, 1ul, 8ul, 8ul>, true> const&)+80>
eflags 0x10206 [ PF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
The vmovapd
instruction at 0x0000555555555850
expects a 0x20-bytes aligned argument, whereas rdx
is0x55555556c2e8
and hence 0x20(%rdx)
is not 0x20-bytes aligned. This results in SEGFAULT.
The issue is very specific. It does not reproduce, if one of the following is true:
- The compiler is changed to
clang version 8.0.0-3
- The
-march
option is removed or changed to-mavx
or-mavx2
- The optimization option is changed to
-O1
or-O0
- One of the matrices
H
,LL
is auto-allocated instead of heap-allocated - The
H
matrix is just aStaticMatrix<Real, NU + NX, NU + NX, columnMajor>
instead ofSymmetricMatrix<StaticMatrix<Real, NU + NX, NU + NX, columnMajor>>
- The
NX
constant is set to 4, 5, 6, or 7
It is still reproduced if
-march=skylake
is changed to-march=haswell
I am not sure whether it is a compiler bug or not.
Comments (12)
-
reporter -
- changed status to wontfix
Hi Misha!
First of all thanks a lot for providing so much valuable feedback about Blaze. The provided info (minimum working example, stack trace, etc.) is exemplary!
In this particular case the problem is caused by using
new
for an overaligned data structure. When usingstd::make_unique
, the underlyingnew
(usually) allocates 16 byte aligned memory. However, since you explicitly compile for the Skylake architecture, AVX is enabled, which requires 32 byte aligned memory. This memory requirement is used byStaticMatrix
(see line 544 in<blaze/math/dense/StaticMatrix.h>
), which unfortunately can cause problems since the underlying dynamic memory might be misaligned (16 byte instead of 32 byte). Hopefully this explains the problem well enough. Hopefully this also explains why the problem is so difficult to reproduce, since in every run you might get different memory.Still, thanks again for the valuable feedback, this is very much appreciated.
Best regards,
Klaus!
-
reporter Hello Klaus,
the alignment does not seem to be the problem. Please look at the following code, which is a modification of the previous example:
#include <blaze/Math.h> #include <iostream> using Real = double; blaze::size_t constexpr NX = 8; blaze::size_t constexpr NU = 1; using HType = blaze::SymmetricMatrix<blaze::StaticMatrix<Real, NU + NX, NU + NX, blaze::columnMajor>>; using LLType = blaze::StaticMatrix<Real, NX + NU, NX + NU, blaze::columnMajor>; void f(HType& H, LLType& LL) { randomize(H); auto Q = blaze::submatrix<NU, NU, NX, NX>(H); auto Lcal = blaze::submatrix<NU, NU, NX, NX>(LL); decltype(Q)::Iterator Q_begin(Q.begin(0)); std::cout << "Q_begin.isAligned() = " << Q_begin.isAligned() << std::endl; Lcal = Q; } template <typename MT, bool SO> void printInfo(std::ostream& os, std::string const& name, blaze::Matrix<MT, SO> const& m) { std::cout << name << ": size=" << sizeof(~m) << ", addr=" << &(~m) << ", data=" << (~m).data() << std::endl; } int main(int, char **) { std::cout << "SIMDSIZE=" << blaze::SIMDTrait<Real>::size << std::endl; HType H_stack; LLType LL_stack; auto H_heap = std::make_unique<HType>(); auto LL_heap = std::make_unique<LLType>(); printInfo(std::cout, "H_stack", H_stack); printInfo(std::cout, "LL_stack", LL_stack); printInfo(std::cout, "H_heap", *H_heap); printInfo(std::cout, "LL_heap", *LL_heap); std::cout << "Using stack" << std::endl; f(H_stack, LL_stack); std::cout << "Ok!" << std::endl; std::cout << "Using heap" << std::endl; f(*H_heap, *LL_heap); std::cout << "Ok!" << std::endl; return 0; }
Program output:
SIMDSIZE=4 H_stack: size=864, addr=0x7ffc12450f00, data=0x7ffc12450f00 LL_stack: size=864, addr=0x7ffc12451260, data=0x7ffc12451260 H_heap: size=864, addr=0x55809f4c1280, data=0x55809f4c1280 LL_heap: size=864, addr=0x55809f4c1660, data=0x55809f4c1660 Using stack Q_begin.isAligned() = 0 Segmentation fault (core dumped)
You see that both stack- and heap-allocated matrices are aligned on 0x20 boundary. Furthermore, the SEGFAULT now happens with the stack-allocated matrix.
-
reporter - changed status to new
The matrices are aligned on 0x20 boundary. Furthermore, the issue is reproducible with stack-allocated matrices.
-
Hi Misha!
I have continued to analyse the issue, but unfortunately so far haven’t been able to reproduce the segmentation fault despite the attempt to provide exactly the same settings:
- I used exactly the same compiler(s) as you (
g++-mp-8 (MacPorts gcc8 8.3.0_4) 8.3.0
); - I used exactly the same compiler flags as you (
g++ -std=c++17 -O2 -g -DNDEBUG -march=skylake ...
); - I used both of your code examples.
I have (re-)analyzed all functions that are involved in the matrix assignment and additionally added output statements to the
loada()
function fordouble
precision values (see<blaze/math/simd/Loada.h>
, line 432) to prove that no aligned load is explicitly triggered which could cause the problematicvmovapd
instruction. So far I have not found anything that could cause the problem, everything seems to work as expected.In summary, either the problem is very well hidden and only surfaces under very specific conditions, or it is indeed a compiler issue. Since there is no way for me to prove that the error is not in Blaze I hope that you have the time to dig a little deeper yourself. Since you have already some experience with Blaze code it should be possible for you to modify some functions to see if there is any effect. For instance, you could explicitly disable aligned loads in the submatrix iterators by setting the
isAligned
flag tofalse
(see<blaze/math/views/submatrix/Dense.h>
, line 3509). Alternatively you can simplify theload()
andstore()
functions in the same nested class (see line 3621 and line 3670) to directly callloadu()
andstoreu()
, respectively (this is the functions that are called from the assignment kernel in line 5907).Thanks a lot for your help, I hope you find something.
Best regards,
Klaus!
- I used exactly the same compiler(s) as you (
-
reporter - attached issue-290.zip
The source, the Makefile and the output files
-
reporter Hello Klaus,
It seems that the reason why the issue is not reproduced on your system is that despite the
-march=skylake
option your compiler tunes the instruction scheduling not for Skylake, but for the CPU architecture it runs on: https://lemire.me/blog/2018/07/25/it-is-more-complicated-than-i-thought-mtune-march-in-gcc/. This can be controlled with the-mtune
option. I found out that if I add-mtune=generic
, then the issue is not reproduced. So it seems necessary to add-mtune=skylake
to reproduce it on a machine with different CPU. Could you please try it?I have provided the archive with the source, the Makefile and the output files from my system. Could you also please check what is the output of
g++-8 -march=native -Q --help=target | grep -- '-march=' | cut -f3
? -
reporter -mtune option value reproduced generic NO sandybridge NO ivybridge NO haswell YES broadwell YES skylake YES cannonlake YES
-
reporter - attached gdb-haswell.s
- attached gdb-ivybridge.s
- attached gdb-generic.s
Disassembly of the generated code for different
-march
options -
reporter The code generated for ivybridge is quite close to the one for generic, whereas the difference between haswell and generic is quite big (see above).
-
- changed status to wontfix
Hi Misha!
Thanks a lot for providing a completely runable example. With this I was indeed able to reproduce the segmentation fault and could start to properly analyze it on my side. Despite several hours of investigation, I did not find any bug in Blaze. On the other hand, I did find several indications that this is indeed a compiler bug:
- The segfault only occurs in GCC, but not in Clang;
- The error only appears with a special "optimization" flag, but not with the usual flags to enable AVX or SSE;
- The Blaze function that would trigger an aligned load of a
double
value is never called, i.e. there is no explanation for themovapd
instruction; - Commenting out branches that are not taken can make the segfault disappear;
- Slightly rewriting the apparently failing
load()/store()
functions without changing the meaning make the segfault disappear:
left.store( right.load() ); // Segfault
auto xmm( right.load() ); std::cout << "Output"; left.store( xmm ); // No segfault
Of course I don't have any proof that this is a compiler error, but since I could not detect anything wrong in Blaze and since the Blazetest has not detected any bug in years, even with random submatrix assignments, I am bound to believe that this is not an issue in Blaze itself.
Still, many thanks to you for being persistent and for investing so much time to provide the information.
Best regards,
Klaus!
-
reporter Hello Klaus,
thank you for looking into this issue one again. I also have the opinion that it is a compiler bug.
- Log in to comment
NOT reproducible with g++ 7.4.0 (
g++-7 (Ubuntu 7.4.0-8ubuntu1) 7.4.0
)