Nondeterministic segfault in moab::SparseTag::set_data

Issue #71 resolved
Johannes Probst created an issue

In our application, we occasionally see segmentation faults in the method tag_set_data. We are using MOAB 4.9.2 on Linux, compiled with GCC 4.8 (Ubuntu 14.04 LTS). There are meshes which do not produce the segfault at all and there are very few meshes which produce the segfault nondeterministically (in about 50% of the invocations).

[9589f8e2cfe4:00294] *** Process received signal ***
[9589f8e2cfe4:00294] Signal: Segmentation fault (11)
[9589f8e2cfe4:00294] Signal code: Address not mapped (1)
[9589f8e2cfe4:00294] Failing at address: 0x3dfd038
[9589f8e2cfe4:00294] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36cb0) [0x7ff8ed96fcb0]
[9589f8e2cfe4:00294] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x97eee) [0x7ff8ed9d0eee]
[9589f8e2cfe4:00294] [ 2] /opt/moab/lib/x86_64-linux-gnu/libMOAB.so.4(_ZN4moab9SparseTag8set_dataEPNS_15SequenceManagerEPNS_5ErrorEPKmmPKv+0x19f) [0x7ff8eea30cdf]
[9589f8e2cfe4:00294] [ 3] /opt/moab/lib/x86_64-linux-gnu/libMOAB.so.4(_ZN4moab4Core12tag_set_dataEPNS_7TagInfoEPKmiPKv+0x50) [0x7ff8ee8f53c0]
... rest of stack trace is in the application.

Due to our strict data protection regulations I can unfortunately not provide test files or code to reproduce the issue. We are of course willing to provide as much support as we can, if it helps to fix the bug. Please feel free to ask us any questions. We will do our best to provide answers.

Our efforts to find the root cause of the bug have been unsuccessful. We also don't understand which conditions have to be met to reproduce it.

Comments (11)

  1. Vijay M

    @jprobst_simscale Can you configure with --enable-debug so that we can see the line numbers and the call stack for the failures ? Also, description of whether this is a serial or parallel run will help further. If you can also talk a little bit about your mesh: type of elements, number of elements, partitioning schemes etc those will be useful too.

    It will be hard to find an exact fix for such a non-deterministic failure without a simplified test case but we will see if something obviously stands out in the code.

  2. Johannes Probst reporter

    @vijaysm Thanks very much for your reply! I will try with--enable-debug and let you know. We are running MOAB in serial. One example of mesh which triggers this error is purely tetrahedral and first order, with the following element count

    • 67568 nodes
    • 26339 edges
    • 132798 triangles
    • 195226 tetrahedrons
    • 14738 element groups (I couldn't quickly figure out how many of each type, but roughly 25 volume groups, 1500 face groups, 4200 edge groups and 9000 node groups)
  3. Iulian Grindeanu

    Also, can you tell us about what type of tag it is (integer, double or bit, etc), on how many entities do you apply it, if you use range version of vector version of setting the tag? Also, how did you obtain the mesh, what format do you use, how do you import/read it in moab, how do you check for correctness (for example,, if you have duplicate edges, triangles, vertices ). A call stack obtained after compiling with debug would help too, as Vijay suggested. Thanks!

  4. Johannes Probst reporter

    We are invoking the method roughly like this

    std::shared_ptr<moab::Interface> _moabInterface;
    moab::ErrorCode errorCode;
    moab::EntityHandle setHandle;
    moab::tag tag;
    std::string name;
    [...]
    errorCode = _moabInterface->tag_set_data(tag, &setHandle, 1, name.c_str());
    

    We are using other types of tags, too. The segfault has only been observed for string tags. Meanwhile I'm rebuilding the application with a debug version of MOAB. Is the flag --enable-debug equivalent to -DCMAKE_BUILD_TYPE=Debug with cmake? If not, how can it be done with cmake?

  5. Vijay M

    In the above example, what is your tag type ? Dense/sparse ? I assume sparse. Also what is the data type ? int or opaque ?

    Is the flag --enable-debug equivalent to -DCMAKE_BUILD_TYPE=Debug with cmake? If not, how can it be done with cmake?

    Yes this is correct. We just want to have debug symbols so that the stack trace shows the line numbers etc. You should also be able to run in gdb/ddd to see why the segfault happened.

  6. Iulian Grindeanu

    yes, -DCMAKE_BUILD_TYPE=Debug should do it

    There is no "string" type, you must be using MB_TYPE_OPAQUE tag

    The size of opaque tags is in bytes; how do you create the tag?

    Maybe it is a "locale" issue. Do you use unicode characters in your name?

    How long are the names? My guess is there is some overflow; although the name should be truncated if not enough space .

    we use opaque type even for arbitrary structures, it does a copy byte by byte

  7. Johannes Probst reporter

    Thanks very much for your responses! You are absolutely correct, we are using the opaque type. The tag is created with

    constexpr unsigned int BUFFER_SIZE = 128;
    [...]
    errorCode = _moabInterface->tag_get_handle("EXTERNAL_NAME", BUFFER_SIZE, MB_TYPE_OPAQUE, _setExternalNameTag, MB_TAG_SPARSE | MB_TAG_CREAT, 0);
    

    The strings are shorter than 30 characters. They are generated automatically as pure ASCII strings, all characters are within [0-9a-zA-Z_]. The method std::string::c_str adds a \0 terminator at the end.

    As an experiment I have once tried with a dense tag, with similar outcome (non-deterministic segfaults and similar tracebacks).

    Meanwhile I have rebuilt MOAB as Debug version and linked our application against it. Now I don't seem to be able to reproduce the bug. I have run the test 6 times, all of which were successful (one run takes about 5 minutes, so it is quite slow).

  8. Iulian Grindeanu

    Hmm, this will be hard to track then; my only suggestion is to use a shorter size; what moab does, it is copying 128 bytes (it is treating it like an array of bytes, does not look for the end of string \0) , and maybe the compiler optimizer tries to do something more, that does crash the code;

    Or can you force your string to be 128 characters? So to be sure that nothing after the string can interfere? (something like name.resize(BUFFER_SIZE) ?)

  9. Johannes Probst reporter

    Thanks for the suggestion! I'll try resizing the string, this is a very good suggestion which we haven't tried yet.

    In order to find out if the Debug option on MOAB had an actual effect, I recompiled the whole application again in Release mode and the segfault appeared right away.

  10. Johannes Probst reporter

    Meanwhile I tested 6 runs and the bug didn't show. The only change was to do name.resize(BUFFER_SIZE) before calling tag_set_data, as suggested by @iulian07. We will run this change through our tests and QA to see if it is really fixed. Thanks everyone for your help, despite the very incomplete issue description, it is much appreciated!

  11. Johannes Probst reporter

    We have been watching our logs and the error hasn't been observed anymore, so we consider it resolved. Thanks again everybody for your help!

  12. Log in to comment