- removed comment
PITTNullCode results depend on number of processors
As part of trying to add SphericalHarmonicReconGen to the ET I noticed that PITTNullCode produces different results depending on the number of cores used. This already happens with the ''regression_test'' test if SphericalHarmonicRecon that is already in the ET. The issue is hidden right now since the test uses very lose tolerances of
ABSTOL 1.5e-5
RELTOL 1.5e-4
Changing them to the standard Cactus accuracies and regenerating the tests shows that the test that was generated using 2 processors fails when run with a single MPI process.
Test data is in the branch rhaas/procdependency .
This needs to be fixed or the arrangement must be marked as broken unless this is the intended behaviour in which case this needs to be documented prominently in the documenation.
Keyword: PITTNullCode
Comments (13)
-
-
- removed comment
The issue seems to be the number of angular ghostzones.
-
reporter - changed status to open
- removed comment
- changed watchers to reisswig@aei.mpg.de
-
reporter - removed comment
Christian Reisswig suggested to change the NullInterp options to the default ones like so:
#Interpolation-------------------------- #NullInterp::interpolation_order = 4 #NullInterp::stereo_patch_type = "circle" #NullInterp::deriv_accuracy = 4
which indeed reduces the differences between the 2 process test run and the 1 process test run, but does not make them go away.
-
- removed comment
Setting the ghost_size to 4, which is correct, makes the results independent of the domain decomposition to the level of roundoff. I can submit a new, more through testsuite
-
reporter - removed comment
Thank you for the updated test, it certainly seems to move into the right direction. I still get differences though if I run it with 2 or one processors (one has to change test.ccl to remove the lenient ATOL and RTOL settings) that are of order 1e-11 relative errors.
Test SphericalHarmonicRecon: regression_test "PITTNullCode/SphericalHarmonicRecon/test/regression_test.par" Issuing mpirun -np 1 /data/rhaas/postdoc/gr/ET_trunk/exe/cactus_sim /data/rhaas/postdoc/gr/ET_trunk/arrangements/PITTNullCode/SphericalHarmonicRecon/test/regression_test.par Jn_l[0]_2D.asc: substantial differences significant differences on 259 (out of 6627) lines maximum absolute difference in column 3 is 1.17932885679295e-11 maximum absolute difference in column 4 is 7.48709427789152e-12 maximum relative difference in column 3 is 6.80798181095467e-11 maximum relative difference in column 4 is 3.29043498091389e-11 (insignificant differences on 1702 lines) NewsB[0]_2D.asc: differences below tolerance on 256 lines jcn[5]_2D.asc: differences below tolerance on 405 lines
In principle I would have expected multiple processors to give exactly the same answer, though if there are really differences in the coordinates then I could believe that it is possible for roundoff level differences to grow. The test only does 2 steps though (
cctk_final_time
is .1 and timestep is 0.05) so 4 orders of magnitude (from 1e-15 roundoff to 1e-11) seem large.Would it be possible to output the coordinates as an hdf5 file (using regular HDF5 output) to check if they actually differ?
Finally if really 4 ghost zones are required then this should really be enforced, either by changing the allowed ranges in param.ccl (if always 4 are required) or through a PARAMCHECK routine (in case the number of ghost zones depends on say the finite difference order).
-
reporter - removed comment
-
- removed comment
The remaining 'issue' is that the interpolator is sensitive to the domain decomposition. This is due to the fact (I think) that grid spacing is calculated locally via something like dspace = (upper - lower)/nzones. I'll attach a parfile where dspace is a power of 2, which should give near identical answers (to within the default tolerance) on any number of processors
-
reporter - removed comment
I used the parfile nad there are still differences. To try and see if the coordinates or the angular spacing were to blame, I added HDF5 output for "NullGrid::StCrd" and "NullGrid::NullGrParArrR" and used the attached python script to recombine the 2 processor data into a single chunk. There is no difference in either coordinates nor the angular grid spacing.
Are there other coordinates to consider? Eg the complex ones (though if the real ones agree I am not sure why the compex ones should not).
-
- removed comment
Here's a patch to include the check for the correct number of ghostzones
diff --git a/NullInterp/src/NullInterp_ParamCheck.F90 b/NullInterp/src/NullInte index deee049..84bbbf4 100644 --- a/NullInterp/src/NullInterp_ParamCheck.F90 +++ b/NullInterp/src/NullInterp_ParamCheck.F90 @@ -21,6 +21,15 @@ subroutine NullInterp_ParamCheck(CCTK_ARGUMENTS) call CCTK_PARAMWARN("Must set NullGrid::N_ang_ghost_pts >= NullGrid::N_an end if
- if (deriv_accuracy.eq.4 .and. N_ang_ghost_pts.lt.4) then
- call CCTK_PARAMWARN("If deriv_accuracy==4, then N_ang_ghost_pts >=4 is re
- end if +
- if (deriv_accuracy.eq.4 .and. N_ang_stencil_size.lt.4) then
- call CCTK_PARAMWARN("If deriv_accuracy==4, then N_stencil_size >=4 is req
- end if + + ! derivative_accuracy = deriv_accuracy
! if (deriv_accuracy.eq.4) then
-
- removed comment
I pushed the changes to the testsuite and parameter checking routine.
-
- changed status to resolved
- removed comment
-
- removed comment
AEILocalInterp (is that the one being used?) by default adjusts the interpolation stencil if there are insufficient ghost zones, and the interpolation point is next to an interprocessor boundary. To avoid this, you should use
boundary_off_centering_tolerance={0.0 0.0 0.0 0.0 0.0 0.0} boundary_extrapolation_tolerance={0.0 0.0 0.0 0.0 0.0 0.0}
in the interpolator parameter string.
See https://bitbucket.org/einsteintoolkit/einsteinanalysis/src/69f49fffa270/Multipole/param.ccl.
See also https://trac.einsteintoolkit.org/ticket/181, in which it was decided to make AEILocalInterp abort if this condition was detected, rather than changing the interpolation operator, but as far as I know this was not implemented.
- Log in to comment
I think the issue is that interpolation are not independent of the domain decomposition. The spacing between points in a given coordinate system is given by (local_max - local_min)/num_cells. I can set up a testsuite that does not evolve the data, but only tests the metric on the inner worldtube. This won't test the main null evolution, nor the Bondi news calculation.