cyclone_itg test fails due to cray compiler bug on JFRS

Issue #9 resolved
Ryusuke Numata created an issue

A segmentation error occurs at subroutine c_redist_36_mpi_copy_nonblock. When passing an array to nbsend/nbrecv with the bounds being specified, the compiler assumes it is a partial array and creates a temporally object on the call stack, which causes an error. According to JFRS support, this is a Cray compiler bug. Attached is a minimal test program for reproducing this error.

Comments (18)

  1. Ryusuke Numata reporter

    On JFRS-1, currently, Cray Fortran Version 8.7.0 and cray-mpich/7.7.0 (module) are available.

  2. Ryusuke Numata reporter

    I’ve got an update from JFRS support saying that this seems not a Cray bug.

    The problem occurs when a temporary object is passed to the non-blocking communication routines. When a temporary object is created as exemplified below and is passed to the non-blocking routines, segfault may occur because tmp may have been deallocated when it is accessed.

    allocate(tmp(7,4))
    tmp(1:7,1:4)=a(2:8,1:4,3)
    call sub(tmp)
    a(2:8,1:4,3)=tmp(1:7,1:4)
    deallocate(tmp)
    

    The reason for a temporary object to be created is following: The F90 specification allows a partial array to be passed to subroutine as an actual argument. However, for backward compatibility, a dummy argument is assumed to have a continuous address if the procedure has no explicit interface. To handle these situations, compilers may create a temporary object. When a temporary object is created depends on the compilers' implementation. Note that we can write a sample program which fails with the same problem even for other compilers.

    So, this seems a potential GS2 bug. We must think about this problem more seriously. Any suggestions?

  3. Ryusuke Numata reporter

    JFRS support suggests to use the mpi_f08 module if which all mpi routines have an explicit interface.

  4. David Dickinson

    I agree that using mpi_f08 is sensible, but I’m not sure we can rely on this being available for all systems, so we’d have to add some more configuration/options to the makefile and preprocessor setup.

    My understanding is that we did originally use something like array(:, :, :, count) but @josephparker changed this recently. I don’t recall the exact reason, but I think it was to avoid some other bug so I think we could do with his input here. I thought the non-blocking mpi routines should make a private copy of the data before releasing the process, but I may be wrong here (or it may be implementation dependent). If this really isn’t the case then I agree this is probably a significant problem for GS2 (but surprising we haven’t seen the problems more often, I guess we don’t pass array slices very often) – the use of persistent non-blocking sends is presumably one way to avoid this – we essentially provide our own buffer that we copy data into for the send/recv so there should be no lifetime issues.

  5. David Dickinson

    The adoption of mpi_f08 looks like a long term project as it will require changing a large number of types throughout the code whilst seeking a way to still support the old approach. As such I think we have three options for the impending release 8.0.2:

    1. Adopt the submitted fix – this has the downside of potentially requiring more communication than necessary (but slower is better than wrong). It also requires an additional change to avoid the original bug that motivated this change (we’ve identified and tested that fix already).
    2. Release “as is” with a known bug but ensure this is documented – this avoids the communication penalty of 1 for those compilers were it’s not needed but doesn’t help those that need/want to use the cray compiler (unless we provide a patch file to be used by those with cray).
    3. Rename redistribute.f90 to redistribute.fpp so that it gets preprocessed and then conditionally introduce the fix in 1 only for builds with cray.

    Are there any thoughts on these options? I’ll tag @Joseph Parker @Peter Hill and @Colin Malcolm Roach for their thoughts as well.

  6. Joseph Parker

    I’m generally not in favour of compiler-specific source code, but I’d prefer 3 here so that we don’t slow down the code for other compilers unnecessarily.

    In the end, are with thinking of this as “a Cray bug” that the Cray-specific source code is a workaround for?

  7. Peter Hill

    Am I right in thinking this is actually Cray MPI specific, and not necessarily the Cray compiler? Not that I imagine many people are using one without the other.

    Personally, I would lean more towards 1 given that a) (IIUC) this is not likely a very common redistribute to be doing, and b) it’s plausible other MPI implementations/compiler combinations have the same problem.

  8. David Dickinson

    I don’t think it’s a bug with Cray – more a potentially unsafe bit of code in gs2 that in practice is fine for most compilers except cray. See https://stackoverflow.com/a/19464663 for some relevant discussion about non-blocking mpi in fortran.

    I’m not a fan of compiler specific code either. If we went with three I’d be keen to ensure we get a better fix (perhaps moving to `mpi_f08` is the solution – but this isn’t universally available so we’d have to require cray use mpi_f08 or similar) by the next release.

    There are other possible approaches that may fix this, such as ensuring we pack into a 1D buffer rather than sending a slice of a 3D array or using a custom mpi_type, but I think these are longer term approaches.

  9. David Dickinson

    It’s not just cray-mpi (although perhaps exacerbated by it) as the problem goes away on Archer if we use the intel compiler but stick with the cray-mpich module.

  10. David Dickinson

    I notice this fix doesn't change the inv version of the routine, which has essentially the same code. It is curious that this doesn’t lead to exactly the same issue that we are fixing here.

    Any thoughts on why this is the case or what the differences are?

  11. David Dickinson

    Digging into the differences has allowed me to come up with another possible bug fix that doesn't have the downsides of the other options. I've tested this with cray on Archer and all tests and benchmarks now pass, @Ryusuke Numata are you able to check if the branch bugfix/alternative_fix_for_bug_9_cray also resolves your observation of the issue? See PR #26

  12. Ryusuke Numata reporter

    @David Dickinson I have confirmed that the branch bugfix/alternative_fix_for_bug_9_cray resolves the problem. Both tests and benchmarks pass both for cray and intel.

  13. Log in to comment