some tests fail on JFRS-1 Cray XC50 systems with Cray compiler

Issue #46 resolved
Ryusuke Numata created an issue

The following tests fail.

asym_geo_fourier asym_geo_global job_manage (avail_cpu_time) gs2_gryfx_zonal nonlinear_terms asym_geo_miller gs2_diagnostics_new (all 4 cases) theta_grid le_grids gs2_optimization (2nd one) gs2_init asym_geo_genElong wstar_units cyclon_itg (3rd one)

I attach full output messages. If I change the compiler to Intel or GNU, everything works fine.

Comments (16)

  1. Ryusuke Numata reporter

    No. These problems on tests are runtime problems. PR #59, #62 fix compile time errors.

    I've looked at some tests, and have found there certainly exist some problems which are ignored by most of compilers. For example, in gs2_diagnostics, write_omega is called for istep=-1 causing the out-of-bounds error for omegahist_woutunits. Probably, these are not problems on the main code, but just on the drivers of unit tests.

    I think these should be fixed, but I'm getting tired of checking all of them as most of users (and compilers) do not care...

  2. David Dickinson

    I'll have a look at fixing these as I have access to a cray compiler -- I find this is often the main challenge of maintaining support for a large range of compilers!

  3. Ryusuke Numata reporter

    I'm asking a JFRS support to help investigate this issue. Some failures are avoided by setting stacksize unlimited and by changing optimization options.

  4. Ryusuke Numata reporter

    With the help by JFRS support, I’ve figured out all the problems and solutions for JFRS with Cray compiler.

    • increase stacksize: On JFRS, stacksize is limited to 8192kb by default. So, users must set stacksize unlimited by hand. (This is not a GS2 problem.)
    # ulimit -s unlimited
    
    • reduce optimization level: The Cray compilers try to do aggressive optimization, which may cause runtime error. PR #25 of Makefiles reduces the default optimization level.
    • Cray compiler or MPICH bug: Due to a bug, the cyclone_itg test fails. I will create a PR to sidestep this bug for the moment. This problem has been reported to Cray by JFRS support, so will be fixed. See utils' Issue #9 and PR #22

    There’s another Cray compiler bug, which prevents the next branch to be compiled. Using the Cray compiler, the module files (.mod) are placed in the object file location without the -J option, then the compilation fails because the module files cannot be found. This is inconsistent behavior with the online manual. I will create another PR to sidestep this problem. (See Makefiles' PR #27)

  5. David Dickinson

    That’s brilliant, thanks for digging into all of these issues and finding solutions for them. We’ll try to make sure these fixes all get into 8.0.2.

  6. Ryusuke Numata reporter
    • changed status to open

    It turns out that one of the problem remains unresolved.

    gs2_diagnostics_new test fails because an out-of-bounds error occurs in write_omega. This occurs when calling run_diagnostics with istep=-1. For unknown reasons, this out-of-bounds is caught only by Cray. GNU and Intel cannot catch this.

    This problem looks harmless, but is clearly a bug, so should be resolved.

  7. David Dickinson

    So it looks like during initialisation for builds with new diagnostics we call run_diagnostics twice, first with istep=-1 and then with istep = 0. Internally it seems new diagnostics uses istep == -1 to indicate that variables should be created but not read or written (see gnostics%create) so the first call is to ensure the variables are created and the second call is meant to populate these variables with the initial values. I think this ideally could do with a lot of redesign.

    I think the simplest fix is to use the istep=-1 case to set the gnostics%create and gnostic%write flags as intended but to then replace istep with 0.

    I’ve pushed a quick attempt at a fix to the branch bugfix/fix_istep_minus_one_issue_46

  8. Log in to comment