- marked as
- changed component to Cactus
- removed comment
some test may be non-deterministic
For the ET_2018_09 release it seems that tests run on Sep 6th and Sept 22nd differ and these tests show a different number of failed tests:
- comet__1_24.log
- cori__1_16.log
- osx-homebrew__2_2.log
- osx-macports__2_2.log
and given that there should not have been a change in code between those days this seems suspicious.
The particular tests that changed are:
+++ b/results/comet__1_24.log
@@ -1434,7 +1434,7 @@
AHFinderDirect: misner1.2-025
Success: 55 files compared, 9 differ in the last digits
AHFinderDirect: recoverML-EE
- Success: 5 files compared, 5 differ in the last digits
+ Failure: 2 files missing, 3 files compared, 3 differ, 3 differ significantly
Carpet: 64k2
Success: 0 files identical
Carpet: test_restrict_sync
+++ b/results/cori__1_16.log
@@ -1446,7 +1446,7 @@
CarpetIOHDF5: CarpetWaveToyNewRecover_test_1proc
Success: 12 files compared, 7 differ in the last digits
CarpetIOHDF5: CarpetWaveToyRecover_test_1proc
- Failure: 12 files missing, 0 files compared, 0 differ
+ Success: 12 files compared, 7 differ in the last digits
CarpetIOHDF5: CarpetWaveToyRecover_test_newcp_1proc
Success: 12 files compared, 7 differ in the last digits
CarpetIOHDF5: newsep
+++ b/results/osx-homebrew__2_2.log
@@ -1631,7 +1631,7 @@
PeriodicCarpet: testperiodicinterp
Success: 1 files identical
QuasiLocalMeasures: qlm-bl
- Failure: 54 files missing, 115 files compared, 115 differ
+ Success: 169 files compared, 143 differ in the last digits
QuasiLocalMeasures: qlm-ks
Success: 169 files compared, 138 differ in the last digits
QuasiLocalMeasures: qlm-ks-EE
+++ b/results/osx-macports__2_2.log
@@ -1631,7 +1631,7 @@
PeriodicCarpet: testperiodicinterp
Success: 1 files identical
QuasiLocalMeasures: qlm-bl
- Failure: 54 files missing, 115 files compared, 115 differ
+ Success: 169 files compared, 144 differ in the last digits
QuasiLocalMeasures: qlm-ks
Success: 169 files compared, 136 differ in the last digits
QuasiLocalMeasures: qlm-ks-EE
Keyword: None
Comments (7)
-
reporter -
- removed comment
I'm confused. I thought all these tests worked on master right before the release?
-
reporter - removed comment
Correct. It may be an issue that the is actual variation from run to run (eg due to a race condition) or it could be that there was an OS update in between (this happened on golub [not listed] which means that one cannot even compile anymore).
I have not looked at failures since I am swamped with other issues.
-
reporter - removed comment
On comet the log file for the failed test contains:
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 0 HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 0: #000: H5Dio.c line 173 in H5Dread(): can't read data major: Dataset minor: Read failed #001: H5Dio.c line 550 in H5D__read(): can't read data major: Dataset minor: Read failed #002: H5Dchunk.c line 1872 in H5D__chunk_read(): unable to read raw data chunk major: Low-level I/O minor: Read failed #003: H5Dchunk.c line 2902 in H5D__chunk_lock(): data pipeline read failed major: Data filters minor: Filter operation failed #004: H5Z.c line 1382 in H5Z_pipeline(): filter returned failure during read major: Data filters minor: Read failed #005: H5Zdeflate.c line 136 in H5Z_filter_deflate(): memory allocation failed for deflate uncompression major: Resource unavailable minor: No space available for allocation WARNING[L1,P0] (CarpetIOHDF5): HDF5 call 'H5Dread(dataset, datatype, memspace, filespace, xfer, cctkGH->data[patch->vindex][timelevel])' returned error code -1
indicating an issue with the HDF5 library. There certainly should be no issue with running out of memory since the total dataset size in the file in question (checkpointML-EE/checkpoint.chkpt.it_1.h5) is quite small:
h5ls -v checkpointML-EE/checkpoint.chkpt.it_1.h5 | gawk '/logical bytes/{sum += $2} END{print sum/1e6}' 19.9976
is only about 20MB. It seems more likely that there is (again) a bug in HDF5's gzip code (see #1878).
We had issues with Comet's file system in the past (
#2073) in relation with writing and immediately reading files which is more or less what is happening for the testsuite data though this does not seem related. -
reporter - removed comment
On cori the error during the "vanilla" run was:
+ srun -n 1 -c 32 /global/cscratch1/sd/rhaas/simulations/testsuite-cori-ET_vanilla-sim-procs000001/SIMFACTORY/exe/cactus_sim -L 3 /global/cscratch1/sd/rhaas/simulations/testsuite-cori-ET_vanilla-sim-procs000001/output-0000/arrangements/Carpet/CarpetIOHDF5/test/CarpetWaveToyRecover_test_1proc.par srun: error: task 0 launch failed: Error configuring interconnect
which looks like a cluster error to me.
-
- removed comment
I've seen that kind of output from hdf5 when mpi is misconfigured, i.e. the mpirun doesn't match the mpicxx.
-
reporter - removed comment
Replying to [comment:6 Steven R. Brandt]:
I've seen that kind of output from hdf5 when mpi is misconfigured, i.e. the mpirun doesn't match the mpicxx.
Ok, so I should check if the MPI stack changed in between the tests.
- Log in to comment