Segfault in relion_reconstruct with large box size and intel compiler

Issue #30 resolved
Thomas Klose created an issue

Technically maybe not a RELION bug, but I thought I would still report it. I was testing relion_reconstruct wit a 1500px box size. Because our cluster is based on Intel Xeons and has the Intel Compilers available, everything was compiled using the suggestions here, with optimization for AVX2. The reconstruction is running okay up to a box size of about 1400px, but when I tested box sizes 1500px or larger, relion_reconstruct crashes before writing out the map. Back projection is running okay (as far as I can tell) and the reconstruction starts okay and runs for several hours. I assume that the error is somewhere when the map gets written to disk, because an empty file with the right file name gets created but has zero bytes. Memory should not be an issue because the nodes have 1TB available and the peak usage is only around 110GB. I tested this with both the MPI and non-MPI version. Compiling on the same hardware using gcc 6.3 allows me to run the reconstruction successfully, although everything is slightly slower.

Comments (8)

  1. Thomas Klose reporter

    I did not. When I switch on ALTCPU with GCC I get compilation errors related to the TBB libraries. I did not look at them in much detail, but I could if it helps.

  2. CharlesCongdon Account Deactivated

    Hmm, I thought we had fixed those TBB link issues. @dkimanius?

    With ALTCPU off the one thing introduced by the new code that I could think of that might cause problems isn't enabled. And the code that does run in fftw.h doesn't have any obvious integer overflows. Can't tell about the output code - looks clean. This is one case where running under gdb and giving the crash stack (type "where" when the program blows up) would really help.

    I've found peak memory tricky to measure when the code crashes, as it can spike much more quickly than default memory sampling by top or vmstat. However, you can get the trends if you do the following:

       vmstat -n -w [sampling_interva] > [log_file] &  
       [Run what you want to track]
       kill $(jobs -p)
    

    And then graph the result in your favorite spreadsheet after stripping off the leading spaces: sed 's/^ *//g' < [log_file] > [stripped_log_file]

  3. Thomas Klose reporter

    I ran the memory sampling, but nothing out of the ordinary happened.

    Running the process with gdb shows the following:

     + Assuming zero beamtilt
     + Back-projecting all images ...
    000/??? sec ~~(,_,">                                                          [oo][New Thread 0x2ab43298e700 (LWP 30773)]
    [New Thread 0x2ab432d8f700 (LWP 30775)]
    [New Thread 0x2ab43ad8f700 (LWP 30774)]
    [New Thread 0x2ab433591700 (LWP 30777)]
    [New Thread 0x2ab433190700 (LWP 30776)]
    [New Thread 0x2ab433992700 (LWP 30778)]
    [New Thread 0x2ab438400700 (LWP 30780)]
    [New Thread 0x2ab433d93700 (LWP 30779)]
    [New Thread 0x2ab438801700 (LWP 30782)]
    [New Thread 0x2ab439404700 (LWP 30788)]
    [New Thread 0x2ab439003700 (LWP 30786)]
    [New Thread 0x2ab438c02700 (LWP 30787)]
    [New Thread 0x2ab43a408700 (LWP 30791)]
    [New Thread 0x2ab439805700 (LWP 30790)]
    [New Thread 0x2ab439c06700 (LWP 30789)]
    [New Thread 0x2ab43a007700 (LWP 30792)]
    [New Thread 0x2ab43a809700 (LWP 30793)]
    [New Thread 0x2ab43b591700 (LWP 30795)]
    [New Thread 0x2ab43b190700 (LWP 30794)]
       5/   5 sec ............................................................~~(,_,">
     + Starting the reconstruction ...
    
    Program received signal SIGSEGV, Segmentation fault.
    castPage2Datatype (this=<optimized out>, srcPtr=0x2abad1a27010,
        page=<optimized out>, datatype=<optimized out>, pageSize=<optimized out>)
        at /depot/mr/apps/relion-3.0_beta/src/image.h:614
    614                             ptr[i] = (float)srcPtr[i];
    Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7_4.2.x86_64 libgcc-4.8.5-16.el7_4.2.x86_64 libstdc++-4.8.5-16.el7_4.2.x86_64
    (gdb) where
    #0  castPage2Datatype (this=<optimized out>, srcPtr=0x2abad1a27010,
        page=<optimized out>, datatype=<optimized out>, pageSize=<optimized out>)
        at /depot/mr/apps/relion-3.0_beta/src/image.h:614
    #1  Image<double>::writeMRC (this=0x2aaab209f9d9 <typeinfo name for double>,
        img_select=46912619805136, isStack=false, mode=-777883632)
        at /depot/mr/apps/relion-3.0_beta/src/rwMRC.h:463
    #2  0x00000000004455fa in Image<double>::_write (
        this=0x2aaab209f9d9 <typeinfo name for double>, name=..., hFile=...,
        select_img=46981869367312, isStack=176, mode=-1392982304)
        at /depot/mr/apps/relion-3.0_beta/src/image.h:1439
    #3  0x000000000042d233 in Reconstructor::reconstruct (
        this=0x2aaab209f9d9 <typeinfo name for double>)
        at /depot/mr/apps/relion-3.0_beta/src/image.h:450
    #4  0x0000000000416fdd in main (argc=-1307969063,
        argv=0x2aaab209f9d0 <typeinfo name for float>)
        at /depot/mr/apps/relion-3.0_beta/src/apps/reconstruct.cpp:30
    

    HTH, Thomas

  4. Log in to comment