SpColbyCol and other column-by-column spgemm ops

Issue #2 new
Aydin Buluc created an issue

Saliya discovered a bug in the column-by-column spgemm implementations in CombBLAS. I can verify that both the old SpColByCol in Friends.h as well as LocalSpGEMM in mtSpGEMM.h reproducibly seg faults. Valgrind wasn't helpful. The matrices are small and skewed. Local matices are 3x100 vs 100x2 for example. But nnz>0 so it isn't about missing a simple check like m=0 or nnz=0.

It is possible (but unlikely) that full "hash-based" versions are fine and this only affects the heap codes. However, I suspect that would be the case and haven't tested.

For the moment, I reverted the MultAnXBn_Synch in ParFriends.h to use the old outer-product implementation (change log here: https://bitbucket.org/berkeleylab/combinatorial-blas-2.0/commits/53156800699573c512589e813ff744b2b226eb73#chg-CombBLAS/include/CombBLAS/ParFriends.h) and this sidetracks the issue for now.

After we are done w/ deadlines, my plan is to dump the local matrices right before multiplication on the ranks the seg fault happens so that I can reproduce the bug w/out MPI (valgrind output gets really crowded w/ MPI).

What we know so far: 1- It isn't about multithreading (because both the serial SpColByCol and threaded ones crash) 2- It isn't about the use of "aux" array (because I forced the system to always use the "scanning based" if/else branch of the dcsc->FillColInds() function and it still seg faults)

For the moment, just know that MultAnXBn_Synch is using the old implementation.

The code that initiates the bug is inside Applications/SegTestApp

Comments (2)

  1. Aydin Buluc reporter

    Use AddressSanitizer identify the issue:

    Described here: https://clang.llvm.org/docs/AddressSanitizer.html

    I had to install clang (brew install llvm) on my laptop, add it to the path, and then call MPI with "mpicxx -cxx=clang++" but in the end it is WAY faster than valgrind and way cleaner output [no false alarms].

    There is a makefile-clang in the repo’s CASpGEMM-tfq branch that compiles with -fsanitize=address….

    ==80897==ERROR: AddressSanitizer: attempting double-free on 0x000113052800 in thread T0:

    #0 0x10cd87192 in wrap__ZdlPv (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x6e192)

    #1 0x10c852274 in std::__1::_DeallocateCaller::__do_call(void*) new:320

    #2 0x10c852258 in std::__1::_DeallocateCaller::__do_deallocate_handle_size(void*, unsigned long) new:278

    #3 0x10c852230 in std::__1::_DeallocateCaller::__do_deallocate_handle_size_align(void*, unsigned long, unsigned long) new:248

  2. Aydin Buluc reporter

    Two separate issues:

    (1) There was a double delete in the AUX handling. The one that was created inside LocalSpGEMM was deleted inside EstimateNNZ function, but then it was also double deleted right before LocalSpGEMM returns. I put logic to make it possible to not delete AUX inside EstimateNNZ

    (2) The input was incorrect. This was why the code still crush despite disallowing any AUX creation/handling

    mat.2.txt included the following line at the end

    3,448,333

    however, the matrix is supposed to have only 400 columns so this is clearly off.

    I deleted the last line and it works now.

    Consequently, I revived the LocalSpGEMM function (which calls the faster column-by-column implementation) within ParFriends.h.

  3. Log in to comment