A bug with mzd_ple()

Issue #74 resolved
yao sun created an issue

Recently, I suffered from a bug of the function mzd_ple() when I eliminated large matrices over GF(2).

For a matrix M, after calling mzd_ple(M, P, Q, 0), the matrix M is updated. And for each row i in M, there should be no nonzero bits between the column i (inclusive) and Q->values[i] (exclusive) in this row. This fact is true for most matrices I meet. However, when I deal with a matrix with 124 920 rows and 932 298 columns (using about 14 GB memory), the fact does not hold. Interestingly, these nonzero bits tend to appear in the same columns, so I guess they are related to the sub-matrices used in the PLE decomposition. This bug makes the decomposition of the whole matrix incorrect.

I think this bug only appears when the matrix is large. For more testing, I decrease the row number of the same matrix M by discarding some rows directly. The bug first appears in the matrix with 60 000 rows, and it still there in matrices with 80 000 rows and 100 000 rows. But the PLE decomposition is correct when the number of rows is no more than 50 000. The cases are the same when I use my macbook pro and two desktop computers running ubuntu 18.04.

I guess this bug will appear in randomly generated matrices, if not, I can upload my matrices through some ways.

My M4RI package has been updated to version 20200125, and the bug is still there.

Comments (9)

  1. Martin Albrecht repo owner

    Hi, just to say that we received your bug report. I have to admit I’m a bit rusty wrt this code, I haven’t touched it in 10 years or so. That said, we obviously want to get to the bottom of this.

    Sorry, if the following should be bloody obvious to the maintainer of a linear algebra package: “And for each row i in M, there should be no nonzero bits between the column i (inclusive) and Q->values[i] (exclusive) in this row.” I take this to mean that we expect all entries M[i,j]== 1 for i <= j < Q->values[i]. I can see that this tends to hold but is that guaranteed for PLE decomposition?

    Here’s some naive experiments with this

    sage: from sage.matrix.matrix_mod2_dense import ple   
    sage: k=14; A = random_matrix(GF(2), 2^k, 2^k); LU, P, Q = ple(A, algorithm="naive"); all(LU[i,j] == 1 for i in range(LU.nrows()) for j in range(i, Q[i]))
    True
    sage: k=15; A = random_matrix(GF(2), 2^k, 2^k); LU, P, Q = ple(A, algorithm="naive"); all(LU[i,j] == 1 for i in range(LU.nrows()) for j in range(i, Q[i]))
    False
    

    I’m using the “naive” algorithm which doesn’t use a lot of the advanced machinery.

  2. Clément Pernet

    Hi,

    I think it means something different: in each row i, there should be only 0’s between the main diagonal (column i) and the pivot of the echelon form E (column Q[i]).

    The only reason your call with k=14 returns true, is probably that your matrix has full rank, and therefore Q=Identity and range(i,Q[i])=emptyset.

    Now, regarding this problem:

    • I first thought it was not a bug but an unspecified behaviour: the “interesting” part of E is located to the right of its pivots, which positions are given by the Q[i]'s. In other implementation of the algorithm (e.g. in fflas-ffpack), the coefficients to the left of the pivots are intentionnaly not zeroed out, to save the cost of doing so.
    • Now after reviewing the code and the specs in m4ri, it seems indeed a bug, as the code actually clears out the the coefficients in _mzd_compress_l and the specs state that the S matrix stored above the main diagonal equals UQ from a PLUQ decomposition.

    The fact that the bug only occurs for matrices above dimension 60000 is clearly because of the threshold for running the PLE algorithm (defaulted at 524288). I guess we could reproduce smaller instances by forcing a smaller cutoff.

    I’ll try to have a look at where the bug comes from (likely in mzd_col_swap_in_rows ) but it looks tedious and I won’t have time for this in the next coming days. Meanwhile, can’t you avoid this problem buy reading only the “interesting part” of E in your application.

    Best

    Clément

  3. yao sun reporter

    Hi, Martin and Clément, thanks for your responses. I think there may be something unclear in my first report. I am using M4RI for solving polynomials over boolean polynomial rings. So my matrices always come from polynomial systems. One characteristic of my matrices is that the pivots of the rows usually do not appear in the diagonal positions, i.e., Q[i] is often larger than i. Since a PLE decomposition do not swap columns, there should be all zeros in the range(i,Q[i]). But this is not true when I use mzd_ple() for large matrices, even though the matrices have full ranks.

    I produce a simple example by the following codes, and I hope it will be helpful.

          mzd_t *M = mzd_init(60000, 1000000);
          mzd_randomize(M);
          mzp_t *P = mzp_init(M->nrows);
          mzp_t *Q = mzp_init(M->ncols);
    
          printf("doing ple...\n");
          rci_t rank = mzd_ple(M, P, Q, 0);
          printf("ple is done. rank %d\n", rank);
    
          for (rci_t row = 0; row < rank; row++) {
              bool flag = false;
              for (rci_t col = row + 1; col < Q->values[row]; col++) {
                  if (ROW_HAS_BIT(M->rows[row], col)) {
                      if (!flag) {
                          printf("row %d (Q->values[%d]: %d) has nonzero bit at col %d", row, row, Q->values[row], col);
                          flag = true;
                      } else {
                          printf(", %d", col);
                      }
                  }
              }
              if (flag) {
                  printf("\n");
              }
          }
    

    And there is a macro:

    #define ROW_HAS_BIT(prow, col) ((((prow)[(col) >> 6]) >> ((col) & 63)) & ((word)(1)))
    

    After about 1500 seconds, it outputs:

    doing ple...
    ple is done. rank 60000
    row 40960 (Q->values[40960]: 41075) has nonzero bit at col 40961, 40963, 40964, 40965, 40966, 40967, 40968, 40975, 40976, 40977, 40982, 40987, 40992, 40994, 40998, 41000, 41004, 41005, 41007, 41009, 41012, 41014, 41015, 41017, 41018, 41020, 41023, 41027, 41032, 41034, 41035, 41036, 41037, 41040, 41042, 41043
    row 40961 (Q->values[40961]: 41076) has nonzero bit at col 40962, 40966, 40967, 40968, 40969, 40978, 40979, 40980, 40981, 40985, 40987, 40989, 40992, 40995, 40997, 40998, 41004, 41006, 41009, 41010, 41012, 41013, 41015, 41017, 41021, 41022, 41023, 41024, 41026, 41028, 41030, 41033, 41036, 41038, 41039, 41040, 41042, 41044
    row 40962 (Q->values[40962]: 41077) has nonzero bit at col 40963, 40964, 40965, 40967, 40968, 40969, 40970, 40971, 40973, 40974, 40976, 40977, 40978, 40982, 40983, 40986, 40988, 40989, 40991, 40992, 40993, 40996, 40998, 40999, 41001, 41002, 41003, 41005, 41007, 41009, 41011, 41012, 41013, 41015, 41016, 41017, 41020, 41021, 41022, 41023, 41025, 41027, 41029, 41032, 41033, 41037, 41041, 41044
    row 40963 (Q->values[40963]: 41078) has nonzero bit at col 40965, 40967, 40970, 40971, 40973, 40976, 40981, 40982, 40983, 40984, 40985, 40986, 40987, 40991, 40993, 40995, 40996, 40998, 40999, 41002, 41004, 41011, 41013, 41015, 41021, 41022, 41023, 41025, 41028, 41029, 41030, 41032, 41040, 41041, 41043, 41044, 41045, 41046
    row 40964 (Q->values[40964]: 41079) has nonzero bit at col 40965, 40967, 40968, 40970, 40976, 40978, 40980, 40981, 40982, 40983, 40984, 40986, 40988, 40991, 40993, 40995, 40998, 40999, 41001, 41006, 41007, 41009, 41010, 41012, 41014, 41015, 41016, 41018, 41019, 41020, 41023, 41028, 41031, 41033, 41034, 41036, 41037, 41040, 41042, 41043, 41046, 41047
    row 40965 (Q->values[40965]: 41080) has nonzero bit at col 40966, 40967, 40969, 40970, 40973, 40977, 40979, 40980, 40983, 40990, 40991, 40994, 40995, 40996, 40999, 41003, 41004, 41007, 41008, 41009, 41011, 41013, 41014, 41018, 41019, 41022, 41023, 41028, 41030, 41031, 41032, 41034, 41044, 41045, 41048
    row 40966 (Q->values[40966]: 41081) has nonzero bit at col 40969, 40971, 40972, 40974, 40975, 40976, 40978, 40986, 40987, 40989, 40990, 40991, 40992, 40993, 40996, 40997, 40999, 41000, 41001, 41002, 41003, 41006, 41013, 41015, 41017, 41022, 41024, 41026, 41029, 41030, 41032, 41037, 41042, 41043, 41044
    row 40967 (Q->values[40967]: 41082) has nonzero bit at col 40969, 40972, 40973, 40974, 40980, 40982, 40985, 40987, 40989, 40991, 40992, 40994, 40995, 40996, 40997, 40999, 41000, 41002, 41003, 41004, 41005, 41006, 41007, 41008, 41009, 41013, 41014, 41015, 41016, 41017, 41019, 41026, 41027, 41035, 41039, 41040, 41041, 41042, 41043, 41044, 41045, 41046, 41048
    row 40968 (Q->values[40968]: 41083) has nonzero bit at col 40974, 40977, 40979, 40980, 40981, 40987, 40990, 40991, 40992, 40993, 40994, 41002, 41003, 41009, 41010, 41012, 41013, 41014, 41016, 41017, 41018, 41020, 41021, 41022, 41024, 41027, 41028, 41029, 41030, 41031, 41032, 41033, 41034, 41038, 41040, 41042, 41048
    row 40969 (Q->values[40969]: 41084) has nonzero bit at col 40971, 40972, 40973, ...
    ...
    ...
    ...
    row 59998 (Q->values[59998]: 60415) has nonzero bit at col 59999, 60003, 60005, 60007, 60010, 60011, 60012, 60014, 60016, 60017, 60018, 60019, 60023, 60024, 60028, 60031, 60032, 60034, 60036, 60038, 60039, 60042, 60044, 60045, 60049, 60050, 60054, 60056, 60057, 60061, 60062, 60063, 60065, 60066, 60077, 60080, 60082, 60084, 60085, 60086, 60087, 60088, 60089, 60095, 60097, 60100, 60102, 60106, 60108, 60109, 60111, 60113, 60114, 60120, 60121, 60123, 60125, 60128, 60130, 60132, 60133, 60135, 60137, 60139, 60144, 60145, 60146, 60147, 60149, 60152, 60153, 60154, 60158, 60160, 60161, 60163, 60165, 60167, 60168, 60169, 60170, 60171, 60172, 60174, 60175, 60177, 60181, 60182, 60183, 60184, 60185, 60187, 60188, 60190, 60192, 60193, 60195, 60201, 60203, 60209, 60211, 60213, 60217, 60219, 60222, 60224, 60226, 60228, 60231, 60232, 60234, 60235, 60236, 60237, 60238, 60239, 60243, 60244, 60245, 60250, 60251, 60255, 60258, 60259, 60260, 60262, 60263, 60264, 60265, 60266, 60267, 60268, 60270, 60272, 60273, 60274, 60275, 60276, 60278, 60281, 60283, 60286, 60290, 60291, 60299, 60300, 60303, 60304, 60305, 60306, 60309, 60313, 60314, 60316, 60317, 60318, 60319, 60321
    row 59999 (Q->values[59999]: 60419) has nonzero bit at col 60003, 60004, 60005, 60007, 60008, 60009, 60013, 60014, 60015, 60017, 60018, 60022, 60024, 60026, 60027, 60029, 60030, 60031, 60034, 60035, 60036, 60043, 60045, 60046, 60048, 60049, 60052, 60053, 60054, 60056, 60059, 60061, 60062, 60065, 60066, 60068, 60071, 60072, 60073, 60074, 60078, 60079, 60081, 60082, 60085, 60086, 60087, 60088, 60092, 60093, 60094, 60096, 60097, 60098, 60105, 60108, 60109, 60111, 60112, 60113, 60114, 60115, 60116, 60117, 60124, 60125, 60126, 60130, 60131, 60133, 60135, 60136, 60138, 60140, 60141, 60143, 60146, 60147, 60151, 60155, 60156, 60158, 60160, 60163, 60164, 60166, 60167, 60168, 60171, 60177, 60178, 60179, 60184, 60186, 60187, 60189, 60190, 60191, 60193, 60194, 60196, 60197, 60198, 60201, 60202, 60203, 60205, 60206, 60208, 60213, 60215, 60216, 60218, 60219, 60224, 60226, 60227, 60228, 60229, 60230, 60231, 60232, 60234, 60235, 60236, 60237, 60240, 60241, 60242, 60243, 60244, 60246, 60247, 60248, 60250, 60251, 60254, 60257, 60258, 60259, 60263, 60266, 60267, 60268, 60269, 60271, 60274, 60277, 60279, 60281, 60283, 60286, 60288, 60289, 60290, 60291, 60292, 60297, 60298, 60299, 60300, 60302, 60307, 60308, 60311, 60313, 60314, 60315, 60317, 60318, 60319, 60321, 60323, 60324
    Time used: 1463.440 sec.
    

    Since my matrices come from polynomials, each column corresponds to a monomial and each row is actually a boolean polynomial. With the help of a solution to this system, I can easily check whether the echelonized form (matrix E) is correct, because all polynomials in the echelonized form should vanish on this solution. Unfortunately, when this bug appears, the matrix E is not correct, even only the “interesting part” of E is considered.

    M4RI has helped me a lot. I hope my report will help to improve it.

    Best regards,

    Yao

  4. Martin Albrecht repo owner

    I can reproduce the behaviour, using your code. But I cannot reproduce it from within Sage for some reason:

    sage: A = random_matrix(GF(2), 60000, 1000000)
    sage: from sage.matrix.matrix_mod2_dense import ple
    sage: %time LU, P, Q = ple(A)
    CPU times: user 20min 49s, sys: 15.3 s, total: 21min 5s
    Wall time: 21min
    sage: all(LU[i,j] == 0 for i in range(LU.nrows()) for j in range(i+1, Q[i]))
    True
    

  5. Martin Albrecht repo owner

    FWIW I set

    -#define __M4RI_PLE_CUTOFF MIN(524288, __M4RI_CPU_L3_CACHE >> 3)
    +#define __M4RI_PLE_CUTOFF MIN(1024, __M4RI_CPU_L3_CACHE >> 3)
    

    and it still wouldn’t fail for mzd_t *M = mzd_init(30000, 1000000); so I’m not sure that cutoff is to blame.

  6. Martin Albrecht repo owner

    Some progress, I narrowed it down to mzd_col_swap_in_rows which is called by _mzd_compress_l. Replacing the former by the appropriate slow mzd_read_bit and mzd_write_bit calls the bug disappears. Furthermore, compiling/running with -fsanitize=address indicates an invalid read.

  7. yao sun reporter

    Dear Martin,

    Thanks very much for resolving this bug. I really appreciate it. Thanks.

    With best regards,

    Yao Sun

  8. Log in to comment