- removed comment
IO corruption on SDSC oasis file systems
I am experiencing file corruption in ASCII output produced by the Cactus code on Comet's (and Gordon's as far as I remember) scratch file systems. This manifests as lines of output being mashed together in the output file.
I have added strace calls to my job script to capture all arguments to the OS's write() function and re-created the write() calls based on this. Those write calls, when replayed on a login node, do produce a correct (no mashed lines) file.
All output to the file in question was from rank 0 only even though the code used MPI and ran on two MPI ranks.
The same code and number of MPI ranks produces a correct output file when run on the $HOME file system.
Thus it seems to me as if there may be an issue with the file system. I can try and reduce the test case to a more minimal example (right now it is a full simulation even though it runs only for <1minute) .
You can find the job script (for account, SLURM options etc) here:
/oasis/scratch/comet/rhaas/temp_project/simulations/OSTREAM_2_12/output-0000/SIMFACTORY/SubmitScript
the script that launches the MPI executable here:
/oasis/scratch/comet/rhaas/temp_project/simulations/OSTREAM_2_12/output-0000/SIMFACTORY/RunScript
the strace output here:
/home/rhaas/strace/strace.1882[67].log
and the awk script to recreate the write calls is:
gawk -vFS='"' '/write.*\/grid-coordinates.xy.asc/{print "printf \""$2"\""}' ~/strace.18826.log >recreate.sh
The corrupted line is eg. line 161 of
/oasis/scratch/comet/rhaas/temp_project/simulations/OSTREAM_2_12/output-0000/TEST/sim/CarpetIOASCII/newsep/grid-coordinates.xy.asc
which reads
1 4 3 4 1 0.1666666666660.505076272276105285714 etc
but should read
1 4 3 4 1 0.166666666666667 -0.0714285714285714
I can avoid the file corruption by flushing the output file after each line.
I am wondering if there is anything known about this or if there is a workaround that does not boil to first writing all data to a file system local to the compute node and copying to /oasis/scratch after the job is finished (how much local space would be available since I would also have to do so for eg checkpoint files and 3d hdf5 output).
Keyword: Comet
Keyword: Gordon
Keyword: SDSC
Comments (5)
-
reporter -
reporter - removed comment
Update from SDSC. The issue has been identified and a new version of Lustre fixes the problem. However deploying the new lustre client caused problems due to incompatibility with the used Lustre server version. So this is being worked on but not yet fixed.
-
reporter - removed comment
The fix does indeed fix the issues for Cactus. Will close this ticket once we have confirmation that the fix has been applied cluster-wide.
-
reporter - changed status to resolved
- removed comment
According to the SDSC support team (ticket #63522 in message from Wed, 13 Dec 2017 20:39:59 -0600) the change got pushed to all the nodes now. Closing this ticket.
-
reporter - edited description
- changed status to closed
- Log in to comment
SDSC's support teams said (02/27/2017 22:30 in XSEDE ticket 63522)