A bug with mzd_ple()
Recently, I suffered from a bug of the function mzd_ple() when I eliminated large matrices over GF(2).
For a matrix M, after calling mzd_ple(M, P, Q, 0), the matrix M is updated. And for each row i in M, there should be no nonzero bits between the column i (inclusive) and Q->values[i] (exclusive) in this row. This fact is true for most matrices I meet. However, when I deal with a matrix with 124 920 rows and 932 298 columns (using about 14 GB memory), the fact does not hold. Interestingly, these nonzero bits tend to appear in the same columns, so I guess they are related to the sub-matrices used in the PLE decomposition. This bug makes the decomposition of the whole matrix incorrect.
I think this bug only appears when the matrix is large. For more testing, I decrease the row number of the same matrix M by discarding some rows directly. The bug first appears in the matrix with 60 000 rows, and it still there in matrices with 80 000 rows and 100 000 rows. But the PLE decomposition is correct when the number of rows is no more than 50 000. The cases are the same when I use my macbook pro and two desktop computers running ubuntu 18.04.
I guess this bug will appear in randomly generated matrices, if not, I can upload my matrices through some ways.
My M4RI package has been updated to version 20200125, and the bug is still there.
Comments (9)
-
repo owner -
Hi,
I think it means something different: in each row i, there should be only 0’s between the main diagonal (column i) and the pivot of the echelon form E (column Q[i]).
The only reason your call with k=14 returns true, is probably that your matrix has full rank, and therefore Q=Identity and range(i,Q[i])=emptyset.
Now, regarding this problem:
- I first thought it was not a bug but an unspecified behaviour: the “interesting” part of E is located to the right of its pivots, which positions are given by the Q[i]'s. In other implementation of the algorithm (e.g. in fflas-ffpack), the coefficients to the left of the pivots are intentionnaly not zeroed out, to save the cost of doing so.
- Now after reviewing the code and the specs in m4ri, it seems indeed a bug, as the code actually clears out the the coefficients in
_mzd_compress_l
and the specs state that the S matrix stored above the main diagonal equals UQ from a PLUQ decomposition.
The fact that the bug only occurs for matrices above dimension 60000 is clearly because of the threshold for running the PLE algorithm (defaulted at 524288). I guess we could reproduce smaller instances by forcing a smaller cutoff.
I’ll try to have a look at where the bug comes from (likely in
mzd_col_swap_in_rows
) but it looks tedious and I won’t have time for this in the next coming days. Meanwhile, can’t you avoid this problem buy reading only the “interesting part” of E in your application.Best
Clément
-
reporter Hi, Martin and Clément, thanks for your responses. I think there may be something unclear in my first report. I am using M4RI for solving polynomials over boolean polynomial rings. So my matrices always come from polynomial systems. One characteristic of my matrices is that the pivots of the rows usually do not appear in the diagonal positions, i.e., Q[i] is often larger than i. Since a PLE decomposition do not swap columns, there should be all zeros in the range(i,Q[i]). But this is not true when I use mzd_ple() for large matrices, even though the matrices have full ranks.
I produce a simple example by the following codes, and I hope it will be helpful.
mzd_t *M = mzd_init(60000, 1000000); mzd_randomize(M); mzp_t *P = mzp_init(M->nrows); mzp_t *Q = mzp_init(M->ncols); printf("doing ple...\n"); rci_t rank = mzd_ple(M, P, Q, 0); printf("ple is done. rank %d\n", rank); for (rci_t row = 0; row < rank; row++) { bool flag = false; for (rci_t col = row + 1; col < Q->values[row]; col++) { if (ROW_HAS_BIT(M->rows[row], col)) { if (!flag) { printf("row %d (Q->values[%d]: %d) has nonzero bit at col %d", row, row, Q->values[row], col); flag = true; } else { printf(", %d", col); } } } if (flag) { printf("\n"); } }
And there is a macro:
#define ROW_HAS_BIT(prow, col) ((((prow)[(col) >> 6]) >> ((col) & 63)) & ((word)(1)))
After about 1500 seconds, it outputs:
doing ple... ple is done. rank 60000 row 40960 (Q->values[40960]: 41075) has nonzero bit at col 40961, 40963, 40964, 40965, 40966, 40967, 40968, 40975, 40976, 40977, 40982, 40987, 40992, 40994, 40998, 41000, 41004, 41005, 41007, 41009, 41012, 41014, 41015, 41017, 41018, 41020, 41023, 41027, 41032, 41034, 41035, 41036, 41037, 41040, 41042, 41043 row 40961 (Q->values[40961]: 41076) has nonzero bit at col 40962, 40966, 40967, 40968, 40969, 40978, 40979, 40980, 40981, 40985, 40987, 40989, 40992, 40995, 40997, 40998, 41004, 41006, 41009, 41010, 41012, 41013, 41015, 41017, 41021, 41022, 41023, 41024, 41026, 41028, 41030, 41033, 41036, 41038, 41039, 41040, 41042, 41044 row 40962 (Q->values[40962]: 41077) has nonzero bit at col 40963, 40964, 40965, 40967, 40968, 40969, 40970, 40971, 40973, 40974, 40976, 40977, 40978, 40982, 40983, 40986, 40988, 40989, 40991, 40992, 40993, 40996, 40998, 40999, 41001, 41002, 41003, 41005, 41007, 41009, 41011, 41012, 41013, 41015, 41016, 41017, 41020, 41021, 41022, 41023, 41025, 41027, 41029, 41032, 41033, 41037, 41041, 41044 row 40963 (Q->values[40963]: 41078) has nonzero bit at col 40965, 40967, 40970, 40971, 40973, 40976, 40981, 40982, 40983, 40984, 40985, 40986, 40987, 40991, 40993, 40995, 40996, 40998, 40999, 41002, 41004, 41011, 41013, 41015, 41021, 41022, 41023, 41025, 41028, 41029, 41030, 41032, 41040, 41041, 41043, 41044, 41045, 41046 row 40964 (Q->values[40964]: 41079) has nonzero bit at col 40965, 40967, 40968, 40970, 40976, 40978, 40980, 40981, 40982, 40983, 40984, 40986, 40988, 40991, 40993, 40995, 40998, 40999, 41001, 41006, 41007, 41009, 41010, 41012, 41014, 41015, 41016, 41018, 41019, 41020, 41023, 41028, 41031, 41033, 41034, 41036, 41037, 41040, 41042, 41043, 41046, 41047 row 40965 (Q->values[40965]: 41080) has nonzero bit at col 40966, 40967, 40969, 40970, 40973, 40977, 40979, 40980, 40983, 40990, 40991, 40994, 40995, 40996, 40999, 41003, 41004, 41007, 41008, 41009, 41011, 41013, 41014, 41018, 41019, 41022, 41023, 41028, 41030, 41031, 41032, 41034, 41044, 41045, 41048 row 40966 (Q->values[40966]: 41081) has nonzero bit at col 40969, 40971, 40972, 40974, 40975, 40976, 40978, 40986, 40987, 40989, 40990, 40991, 40992, 40993, 40996, 40997, 40999, 41000, 41001, 41002, 41003, 41006, 41013, 41015, 41017, 41022, 41024, 41026, 41029, 41030, 41032, 41037, 41042, 41043, 41044 row 40967 (Q->values[40967]: 41082) has nonzero bit at col 40969, 40972, 40973, 40974, 40980, 40982, 40985, 40987, 40989, 40991, 40992, 40994, 40995, 40996, 40997, 40999, 41000, 41002, 41003, 41004, 41005, 41006, 41007, 41008, 41009, 41013, 41014, 41015, 41016, 41017, 41019, 41026, 41027, 41035, 41039, 41040, 41041, 41042, 41043, 41044, 41045, 41046, 41048 row 40968 (Q->values[40968]: 41083) has nonzero bit at col 40974, 40977, 40979, 40980, 40981, 40987, 40990, 40991, 40992, 40993, 40994, 41002, 41003, 41009, 41010, 41012, 41013, 41014, 41016, 41017, 41018, 41020, 41021, 41022, 41024, 41027, 41028, 41029, 41030, 41031, 41032, 41033, 41034, 41038, 41040, 41042, 41048 row 40969 (Q->values[40969]: 41084) has nonzero bit at col 40971, 40972, 40973, ... ... ... ... row 59998 (Q->values[59998]: 60415) has nonzero bit at col 59999, 60003, 60005, 60007, 60010, 60011, 60012, 60014, 60016, 60017, 60018, 60019, 60023, 60024, 60028, 60031, 60032, 60034, 60036, 60038, 60039, 60042, 60044, 60045, 60049, 60050, 60054, 60056, 60057, 60061, 60062, 60063, 60065, 60066, 60077, 60080, 60082, 60084, 60085, 60086, 60087, 60088, 60089, 60095, 60097, 60100, 60102, 60106, 60108, 60109, 60111, 60113, 60114, 60120, 60121, 60123, 60125, 60128, 60130, 60132, 60133, 60135, 60137, 60139, 60144, 60145, 60146, 60147, 60149, 60152, 60153, 60154, 60158, 60160, 60161, 60163, 60165, 60167, 60168, 60169, 60170, 60171, 60172, 60174, 60175, 60177, 60181, 60182, 60183, 60184, 60185, 60187, 60188, 60190, 60192, 60193, 60195, 60201, 60203, 60209, 60211, 60213, 60217, 60219, 60222, 60224, 60226, 60228, 60231, 60232, 60234, 60235, 60236, 60237, 60238, 60239, 60243, 60244, 60245, 60250, 60251, 60255, 60258, 60259, 60260, 60262, 60263, 60264, 60265, 60266, 60267, 60268, 60270, 60272, 60273, 60274, 60275, 60276, 60278, 60281, 60283, 60286, 60290, 60291, 60299, 60300, 60303, 60304, 60305, 60306, 60309, 60313, 60314, 60316, 60317, 60318, 60319, 60321 row 59999 (Q->values[59999]: 60419) has nonzero bit at col 60003, 60004, 60005, 60007, 60008, 60009, 60013, 60014, 60015, 60017, 60018, 60022, 60024, 60026, 60027, 60029, 60030, 60031, 60034, 60035, 60036, 60043, 60045, 60046, 60048, 60049, 60052, 60053, 60054, 60056, 60059, 60061, 60062, 60065, 60066, 60068, 60071, 60072, 60073, 60074, 60078, 60079, 60081, 60082, 60085, 60086, 60087, 60088, 60092, 60093, 60094, 60096, 60097, 60098, 60105, 60108, 60109, 60111, 60112, 60113, 60114, 60115, 60116, 60117, 60124, 60125, 60126, 60130, 60131, 60133, 60135, 60136, 60138, 60140, 60141, 60143, 60146, 60147, 60151, 60155, 60156, 60158, 60160, 60163, 60164, 60166, 60167, 60168, 60171, 60177, 60178, 60179, 60184, 60186, 60187, 60189, 60190, 60191, 60193, 60194, 60196, 60197, 60198, 60201, 60202, 60203, 60205, 60206, 60208, 60213, 60215, 60216, 60218, 60219, 60224, 60226, 60227, 60228, 60229, 60230, 60231, 60232, 60234, 60235, 60236, 60237, 60240, 60241, 60242, 60243, 60244, 60246, 60247, 60248, 60250, 60251, 60254, 60257, 60258, 60259, 60263, 60266, 60267, 60268, 60269, 60271, 60274, 60277, 60279, 60281, 60283, 60286, 60288, 60289, 60290, 60291, 60292, 60297, 60298, 60299, 60300, 60302, 60307, 60308, 60311, 60313, 60314, 60315, 60317, 60318, 60319, 60321, 60323, 60324 Time used: 1463.440 sec.
Since my matrices come from polynomials, each column corresponds to a monomial and each row is actually a boolean polynomial. With the help of a solution to this system, I can easily check whether the echelonized form (matrix E) is correct, because all polynomials in the echelonized form should vanish on this solution. Unfortunately, when this bug appears, the matrix E is not correct, even only the “interesting part” of E is considered.
M4RI has helped me a lot. I hope my report will help to improve it.
Best regards,
Yao
-
reporter - attached outputs_of_the_example.rtf
full outputs of the example
-
repo owner I can reproduce the behaviour, using your code. But I cannot reproduce it from within Sage for some reason:
sage: A = random_matrix(GF(2), 60000, 1000000) sage: from sage.matrix.matrix_mod2_dense import ple sage: %time LU, P, Q = ple(A) CPU times: user 20min 49s, sys: 15.3 s, total: 21min 5s Wall time: 21min sage: all(LU[i,j] == 0 for i in range(LU.nrows()) for j in range(i+1, Q[i])) True
-
repo owner FWIW I set
-#define __M4RI_PLE_CUTOFF MIN(524288, __M4RI_CPU_L3_CACHE >> 3) +#define __M4RI_PLE_CUTOFF MIN(1024, __M4RI_CPU_L3_CACHE >> 3)
and it still wouldn’t fail for
mzd_t *M = mzd_init(30000, 1000000);
so I’m not sure that cutoff is to blame. -
repo owner Some progress, I narrowed it down to
mzd_col_swap_in_rows
which is called by_mzd_compress_l
. Replacing the former by the appropriate slowmzd_read_bit
andmzd_write_bit
calls the bug disappears. Furthermore, compiling/running with-fsanitize=address
indicates an invalid read. -
repo owner - changed status to resolved
-
reporter Dear Martin,
Thanks very much for resolving this bug. I really appreciate it. Thanks.
With best regards,
Yao Sun
- Log in to comment
Hi, just to say that we received your bug report. I have to admit I’m a bit rusty wrt this code, I haven’t touched it in 10 years or so. That said, we obviously want to get to the bottom of this.
Sorry, if the following should be bloody obvious to the maintainer of a linear algebra package: “And for each row
i
inM
, there should be no nonzero bits between the columni
(inclusive) andQ->values[i]
(exclusive) in this row.” I take this to mean that we expect all entriesM[i,j]== 1
fori <= j < Q->values[i]
. I can see that this tends to hold but is that guaranteed for PLE decomposition?Here’s some naive experiments with this
I’m using the “naive” algorithm which doesn’t use a lot of the advanced machinery.