AutoPicker Bad Alloc in PDF Logfile Creation

Issue #1 resolved
craigyk created an issue

I've been getting intermittent crashes running the AutoPicker.

It seems to happen when writing the final results:

 Autopicking ...
17.37/17.37 min ............................................................~~(,_,">
 Generating logfile.pdf ... 
  10/  10 sec ............................................................~~(,_,">
 Total number of particles from 2241 micrographs is 88220
 i.e. on average there were 39 particles per micrograph
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------

This is the stacktrace:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[henry5:410975] *** Process received signal ***
[henry5:410975] Signal: Aborted (6)
[henry5:410975] Signal code:  (-6)
[henry5:410975] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7f25a78406d0]
[henry5:410975] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f25a749a277]
[henry5:410975] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f25a749b968]
[henry5:410975] [ 3] /lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x7f25a81ebac5]
[henry5:410975] [ 4] /lib64/libstdc++.so.6(+0x5ea36)[0x7f25a81e9a36]
[henry5:410975] [ 5] /lib64/libstdc++.so.6(+0x5ea63)[0x7f25a81e9a63]
[henry5:410975] [ 6] /lib64/libstdc++.so.6(+0x5ec83)[0x7f25a81e9c83]
[henry5:410975] [ 7] /lib64/libstdc++.so.6(+0xb3782)[0x7f25a823e782]
[henry5:410975] [ 8] /eppec/storage/sw/relion/3.0b/bin/relion_autopick_mpi(_ZN13MetaDataTable15columnHistogramE8EMDLabelRSt6vectorIdSaIdEES4_iP7CPlot2Dlddbb+0x10d1)[0x4b8fe1]
[henry5:410975] [ 9] /eppec/storage/sw/relion/3.0b/bin/relion_autopick_mpi(_ZN10AutoPicker18generatePDFLogfileEv+0xcad)[0x437f6d]
[henry5:410975] [10] /eppec/storage/sw/relion/3.0b/bin/relion_autopick_mpi(main+0x149)[0x42c939]
[henry5:410975] [11] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f25a7486445]
[henry5:410975] [12] /eppec/storage/sw/relion/3.0b/bin/relion_autopick_mpi[0x42fd00]
[henry5:410975] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node henry5 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Comments (8)

  1. craigyk reporter

    I can confirm the problem is in the code that generates the PDF report. I commented out the call in the app source files, recompiled, and these jobs finished.

  2. Takanori Nakane

    Thank you very much for feedback.

    Could you please try again after changing MDresult.columnHistogram(EMDL_MLMODEL_GROUP_NR_PARTICLES,histX,histY,0, plot2D); and MDresult.columnHistogram(EMDL_PARTICLE_AUTOPICK_FOM,histX,histY,0, plot2Dd); in autopicker.cpp to MDresult.columnHistogram(EMDL_MLMODEL_GROUP_NR_PARTICLES,histX,histY,1, plot2D); and MDresult.columnHistogram(EMDL_PARTICLE_AUTOPICK_FOM,histX,histY,1, plot2Dd);?

    This will print some information useful for debugging.

  3. Shaun Rawson

    Hi Takanori,

    I appear to be encountering the same bug with the PDF generation in a CtfFind job:

    #!
    
    terminate called after throwing an instance of 'std::bad_alloc'
      what():  std::bad_alloc
    [em-drift1:221095] *** Process received signal ***
    [em-drift1:221095] Signal: Aborted (6)
    [em-drift1:221095] Signal code:  (-6)
    [em-drift1:221095] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b1c66b836d0]
    [em-drift1:221095] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b1c66dc6277]
    [em-drift1:221095] [ 2] /lib64/libc.so.6(abort+0x148)[0x2b1c66dc7968]
    [em-drift1:221095] [ 3] /programs/x86_64-linux/relion/3.0_beta_cu8.0/extlib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x12d)[0x33144bea8d]
    [em-drift1:221095] [ 4] /programs/x86_64-linux/relion/3.0_beta_cu8.0/extlib/libstdc++.so.6[0x33144bcbe6]
    [em-drift1:221095] [ 5] /programs/x86_64-linux/relion/3.0_beta_cu8.0/extlib/libstdc++.so.6[0x33144bcc13]
    [em-drift1:221095] [ 6] /programs/x86_64-linux/relion/3.0_beta_cu8.0/extlib/libstdc++.so.6[0x33144bcd32]
    [em-drift1:221095] [ 7] /programs/x86_64-linux/relion/3.0_beta_cu8.0/extlib/libstdc++.so.6[0x33144615c2]
    [em-drift1:221095] [ 8] /programs/x86_64-linux/relion/3.0_beta_cu8.0/bin/relion_run_ctffind_mpi(_ZN13MetaDataTable15columnHistogramE8EMDLabelRSt6vectorIdSaIdEES4_iP7CPlot2Dlddbb+0x1210)[0x498100]
    [em-drift1:221095] [ 9] /programs/x86_64-linux/relion/3.0_beta_cu8.0/bin/relion_run_ctffind_mpi(_ZN13CtffindRunner18joinCtffindResultsEv+0xf5a)[0x427eba]
    [em-drift1:221095] [10] /programs/x86_64-linux/relion/3.0_beta_cu8.0/bin/relion_run_ctffind_mpi(_ZN16CtffindRunnerMpi3runEv+0x20c)[0x478ffc]
    [em-drift1:221095] [11] /programs/x86_64-linux/relion/3.0_beta_cu8.0/bin/relion_run_ctffind_mpi(main+0x3b)[0x41cddb]
    [em-drift1:221095] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1c66db2445]
    [em-drift1:221095] [13] /programs/x86_64-linux/relion/3.0_beta_cu8.0/bin/relion_run_ctffind_mpi[0x41d401]
    [em-drift1:221095] *** End of error message ***
    [warn] Epoll ADD(4) on fd 42 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
    [warn] Epoll ADD(4) on fd 39 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
    

    The CtfFind job had been running as part of the relion_it.py pipeline and had previously run 450 times without issue in that loop before the error. It seems the star file was still written correctly at the end of the run.

    This relion build was made on 01Aug2018.

  4. Takanori Nakane

    Thank you very much for your report. Today I made two changes to the histogram generation code. Could you please update your local installation and try again?

  5. Shaun Rawson

    We've updated our installation but I doubt I'll be able to reproduce the error as it happened long into a scheduled loop. I'll let you know if it happens again.

  6. Log in to comment