Lazylogger Segmentation fault

Issue #40 resolved
Ryan Loots created an issue

Hi

The lazylogger from the latest git pull, seems to cause, 'still', a segmentation fault, when trying to do disk -> disk copy.

I found this "bug" awhile back, but since we've forked midas for some time now, I fixed it, and then forgot about it. I am busy merging our fork of midas to the latest upstream(git) version, so I ran into this "bug" again.

Below is a break down of the error.

I'll post the gdb output, and the offending code segment, as well as my impromptu fix.

#gdb segmentation fault output

eading symbols from /home/xxx/new_midas/midas/linux/bin/lazylogger...done.
(gdb) run -c DISK
Starting program: /home/xxx/new_midas/midas/linux/bin/lazylogger -c DISK
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Lazy,ERROR] [odb.c:1229:db_open_database,ERROR] Removed ODB client 'Lazy_Disk', index 3 because process pid 6121 does not exists
[Lazy,INFO] Updated notify_count of "/Experiment/Security/RPC hosts/Allowed hosts" from 4 to 3
[Lazy,INFO] Removed open record flag from "/Runinfo/State"
[Lazy,INFO] Removed exclusive access mode from "/Runinfo/State"
[Lazy,INFO] Removed open record flag from "/Lazy/Disk/Settings"
[Lazy,INFO] Removed exclusive access mode from "/Lazy/Disk/Settings"
[Lazy,INFO] Removed open record flag from "/Lazy/Disk/Statistics"
[Lazy,INFO] Removed exclusive access mode from "/Lazy/Disk/Statistics"
[Lazy,INFO] Corrected 4 ODB entries
[Lazy,INFO] Deleted entry '/System/Clients/6121' for client 'Lazy_Disk' because it is not connected to ODB
Lazy_Disk starting... ! to exit 
output disk size 694119.8 MiB, free 63950.3 MiB
[Lazy_Disk,INFO] Output file '/home/xxx/new_midas/online/data/run00001.mid.gz' already exists, removing
[Lazy_Disk,INFO] Starting lazy_disk_copy '/home/xxx/online/data/run00001.mid.gz' to '/home/xxx/new_midas/online/data/run00001.mid.gz'
[Lazy_Disk,INFO] Starting lazy_disk_copy '/home/xxx/online/data/run00001.mid.gz' to '/home/xxx/new_midas/online/data/run00001.mid.gz'

Program received signal SIGSEGV, Segmentation fault.
lazy_disk_copy_loop (outfile=0x7fffffffdbc0 "/home/xxx/new_midas/online/data/run00001.mid.gz", infile=0x7fffffffdb40 "/home/xxx/online/data/run00001.mid.gz", fpout=0x6a9ff0, 
    fpin=0x6a9940) at src/lazylogger.cxx:1326
1326    {
(gdb) bt
#0  lazy_disk_copy_loop (outfile=0x7fffffffdbc0 "/home/xxx/new_midas/online/data/run00001.mid.gz", infile=0x7fffffffdb40 "/home/xxx/online/data/run00001.mid.gz", fpout=0x6a9ff0, 
    fpin=0x6a9940) at src/lazylogger.cxx:1326
#1  0x0000000000406ff9 in lazy_disk_copy (outfile=0x7fffffffdbc0 "/home/xxx/new_midas/online/data/run00001.mid.gz", infile=0x7fffffffdb40 "/home/xxx/online/data/run00001.mid.gz")
    at src/lazylogger.cxx:1468
#2  0x000000000040a3bf in lazy_main (channel=<optimized out>, pLall=<optimized out>) at src/lazylogger.cxx:2080
#3  0x00000000004047f9 in main (argc=1, argv=0x662d00 <lazyinfo>) at src/lazylogger.cxx:2506

#Setting a breakpoint

(gdb) run -c Disk
Starting program: /home/xxx/new_midas/midas/linux/bin/lazylogger -c Disk
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Lazy_Disk starting... ! to exit 
output disk size 694119.8 MiB, free 63771.5 MiB
[Lazy_Disk,INFO] Output file '/home/xxx/new_midas/online/data/run00001.mid.gz' already exists, removing
[Lazy_Disk,INFO] Starting lazy_disk_copy '/home/xxx/online/data/run00001.mid.gz' to '/home/xxx/new_midas/online/data/run00001.mid.gz'
[Lazy_Disk,INFO] Starting lazy_disk_copy '/home/xxx/online/data/run00001.mid.gz' to '/home/xxx/new_midas/online/data/run00001.mid.gz'

Breakpoint 1, lazy_disk_copy_loop (outfile=0x7fffffffdbc0 "/home/xxx/new_midas/online/data/run00001.mid.gz", infile=0x7fffffffdb40 "/home/xxx/online/data/run00001.mid.gz", 
    fpout=0x6a9ff0, fpin=0x7fffffffd8a0) at src/lazylogger.cxx:1326
1326    {

(gdb) next
1338             int rd = fread(buf, 1, kBufSize, fpin);
(gdb) print kBufSize
$1 = <optimized out>
(gdb) print buf
Cannot access memory at address 0x7fffff5fd840

#Offending code snippet

int lazy_disk_copy_loop(const char *outfile, const char *infile, FILE* fpout, FILE* fpin)
{
.
.
.
 /* infinite loop while copying */
   while (1) {
      if (copy_continue) {
         const int kBufSize = 10*1024*1024;
         char buf[kBufSize];
         int rd = fread(buf, 1, kBufSize, fpin);

#compiler # gcc -v gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)

Also tested this on debian. gcc version 4.9.2 (Debian 4.9.2-10) Exact same error was produced.

I've included the diff file from our older forked version of midas, compared with the latest git pull.

The last commit from this git pull we have is,

commit e173b03742e7186288d425606ca831fa7401e958 Author: Stefan Ritt stefan.ritt@psi.ch Date: Tue Sep 27 13:13:16 2016 +0200

#Odb Settings

Period                          10
Maintain free space (%)         0
Stay behind                     0
Alarm Class                     
Running condition               ALWAYS
Data dir                        /home/xxx/online/data
Data format                     MIDAS
Filename format                 run%05d.mid.gz
Backup type                     Disk
Execute after rewind            
Path                            /home/xxx/new_midas/online/data
Capacity (Bytes)                5e+09
List label                      Disk
Execute before writing file     
Execute after writing file      
Modulo.Position                 
Tape Data Append                n

Maybe I overlooked something that caused the seg fault?

Comments (13)

  1. dd1

    The code on lazylogger was not touch for a long time. But there is possibility of a bug. We mostly use the "SCRIPT" method for writing to CASTOR and EOS (at CERN) and DCACHE (at TRIUMF). The current implementation of the "DISK" method was done to test writing to HADOOP HDFS (worked just fine), but never used in production, AFAIK. (maybe in DEAP at SNOLAB?). K.O.

  2. dd1

    this rings a bell, this rings a bell. I just fixed the same problem in mhttpd. The problem is with the large array being allocated on the stack (local variables inside functions are allocated on the stack). This used to work, but "suddenly" started to fail with segfaults. It was never clear how much stack space is actually available (there is contention for stack space between different threads) and when stack space runs out, only possible result is segfault (where malloc() can return NULL). So maybe there was a recent change in the stack layouts resulting in reduced available stack size. K.O.

  3. dd1

    Ok, check this out. Run "ulimit -a" to check user limits. Observe "stack size", my Centos7 machine says 8192 kbytes, which is around 8 Mbytes. (not sure if those are kbytes or kibytes, 1000 vs 1024). lazylogger demands 10Mbytes, so will crash for sure. Ouch. Your fix is correct. Allocate buffer on the heap using malloc(). K.O.

  4. dd1

    Impossible to replace stack allocation with malloc() without rewriting the function completely - too many "return" paths everywhere. do not want to replace crash with memory leak. Reduce buffer size to be smaller than stack size limit. Kludge, kludge, kludge. K.O,

  5. Stefan Ritt
    auto buf = std::unique_ptr<char, decltype(free)*>{
                reinterpret_cast<char *>(malloc(10*1024*1024)),
                free };
    

    :-D

  6. Stefan Ritt

    Well, you were the first coming up with std::string. This is just a bit more advanced C++. An unique_ptr calls the free() via it's destructor when going out of scope. So no explicit free() needed. Cool, eh? Shall I bring a C++ 11 book with me?

  7. dd1

    it may be c++11, but it is unreadable gibberish, might as well program in c-sharp. behind the coolness and verbosity hides a performance bug, instead of allocating the buffer once, it is allocated and freed at each read/write cycle. the on-stack implementation did not have this bug because on-stack allocation is "free". the fix is to move the allocation to the outermost scope of the function. also might as well use the c++ allocator "new" instead of malloc/free. BTW use of c++11 makes the code unbuildable on SL6 where the standard compiler cannot do c++11. K.O.

  8. Stefan Ritt

    Well, when you put for the first this into midas:

    std::cout << "Filename ="+fn << std::endl;
    

    it looked to me also like gibberish Klingon. And you just did that without asking anybody. I was a bit confused, but finally bit the bullet. Read some C++ books, even took a C++ course. And finally I got convinced that C++ is the way to go. My favourite C++ program is this one, which counts occurrences of words in a sentence:

    #include <iostream>
    #include <string>
    #include <vector>
    #include <map>
    
    using namespace std;
    
    int main()
    {
       vector<string> w = { "one", "two", "two", "three", "one" };
       map <string, int> words;
    
       for (const auto &i : w)
          words[i]++;
    
       for (const auto &i : words)
          cout << i.first << " : " << i.second << endl;
    
       return 0;
    }
    

    Try it! And then tell me how many lines of code it would take to program this in C!

    There are much more nice features in C++ 7 and 11, like standard threads, mutexes with timeouts (I know you like that!) etc. etc. There is even a fantastic JSON parser at

    https://github.com/nlohmann/json

    which allows you things like

    json j = {
      {"pi", 3.141},
      {"happy", true},
      {"answer", {
        {"everything", 42}
      }},
      {"list", {1, 0, 2}},
      {"object", {
        {"currency", "USD"},
        {"value", 42.99}
      }}
    };
    

    right in your C++ code!

    One should not neglect this because learning all this might take a bit of time. I finally went through this, driven by your C++ additions, now I expect you to do the same.

    Concerning the buffer allocation, I agree there is some overhead, but it's small. Try following program:

    #include <iostream>
    #include <chrono>
    #include <cstring>
    #include <memory>
    
    #define SIZE 100*1024 // 100kB
    
    void mem_stack()
    {
       char buf[SIZE];
       std::memset(buf, 0, SIZE);
    }
    
    void mem_malloc()
    {
       std::unique_ptr<char[]> buf(new char[SIZE]);
       std::memset(&buf[0], 0, SIZE);
    }
    
    int main(int argc, const char * argv[]) {
    
       auto start = std::chrono::system_clock::now();
       std::chrono::duration<double> diff;
       int i = 0;
       do {
          for (int j=0 ; j<100 ; j++)
             mem_stack();
          auto now = std::chrono::system_clock::now();
          diff = now-start;
          i += 100;
       } while (diff.count() < 1);
    
       std::cout << "Stack allocation: " << i << " calls/s" << std::endl;
    
       start = std::chrono::system_clock::now();
       i = 0;
       do {
          for (int j=0 ; j<100 ; j++)
             mem_malloc();
          auto now = std::chrono::system_clock::now();
          diff = now-start;
          i += 100;
       } while (diff.count() < 1);
    
       std::cout << "Heap allocation:  " << i << " calls/s" << std::endl;
    
    
       return 0;
    }
    

    It compares stack allocation with dynamic heap allocation, using std::unique_ptr which automatically de-allocates the array. I use a typical event size of 100k, and to so something with the buffer I fill it with zeros. On my laptop I get:

    Stack allocation: 325700 calls/s
    Heap allocation:  322300 calls/s
    

    telling me that the dynamic memory allocation is "cheap". You still can do >> 100'000 events per second. It might have been different some years ago, but that has changed.

    And BTW, this program above compiles fine under SL5.1 with gcc 4.6.3 (just use the -std=c++0x flag).

    Stefan

  9. Log in to comment