CarpetLib internal error with PreSync and analysis thorns

Issue #2639 new
Samuel Cupp created an issue

I’ve found a strange error involving variables from analysis thorns with multiple timelevels (such as WeylScal4 and ML_ADMConstraints). The error is

WARNING level 0 from host r1i6n25.ib0.sawtooth.inl.gov process 0
  in thorn CarpetLib, file /home/cuppsamu/toolkit/Cactus/arrangements/Carpet/CarpetLib/src/gdata.cc:315:
  -> Internal error: extrapolation in time.  variable=ML_ADMCONSTRAINTS::H  time=0.040625000000000001  times=[0.018749999999999999,0,-0.018749999999999999]
cactus_intel19: /home/cuppsamu/toolkit/Cactus/arrangements/Carpet/Carpet/src/helpers.cc:275: int Carpet::Abort(const _cGH *, int): Assertion `0' failed.
Rank 0 with PID 185127 received signal 6

This error comes from repos/carpet/CarpetLib/src/gdata.cc:

void gdata::find_source_timelevel(vector<CCTK_REAL> const &times,
                                  CCTK_REAL const time, int const order_time,
                                  operator_type const op, int &timelevel0,
                                  int &ntimelevels) const {
  // Ensure that the times are consistent
  assert(times.size() > 0);
  assert(order_time >= 0);

  CCTK_REAL const eps = 1.0e-12;
  CCTK_REAL const min_time = *min_element(times.begin(), times.end());
  CCTK_REAL const max_time = *max_element(times.begin(), times.end());
  // TODO: Use a real delta-time from somewhere instead of 1.0
  CCTK_REAL const some_time = std::fabs(min_time) + std::fabs(max_time) + 1.0;
  if (op != op_copy) {
    if (time < min_time - eps * some_time or
        time > max_time + eps * some_time) {
      ostringstream buf;
      buf << setprecision(17) << "Internal error: extrapolation in time.";
      if (varindex >= 0) {
        char *const fullname = CCTK_FullName(varindex);
        buf << "  variable=" << fullname;
        ::free(fullname);
      }
      buf << "  time=" << time << "  times=" << times;
      CCTK_ERROR(buf.str().c_str());
    }
  }

I attached the backtrace, as well as the thornlist and parfile I am using.

If I comment out ML_ADMConstraints, I get the same error for WeylScal4. If I also comment out WeylScal4, the simulation proceeds until I hit an error related to the read/writes I’m working on. Since this is happening in CycleTimeLevels → SyncProlongateGroups, it should be related to how we’re handling the timelevels for these variables. Even though these are analysis variables, my understanding of why they need multiple timelevels is because the IO thorns may try to output at an iteration where the coarsest level isn’t available, so it has to interpolate in time to get the values. However, these variables are only for analysis and are therefore only ever written except for IO. Since the error only happens if I turn on PreSync (Cactus::presync_mode = "mixed-error"), I am thinking this could somehow connect to how we’re handling the automation of ghost zones/outer boundaries (no syncs automatically triggered since there’s no READs for these variables), but I don’t have a clear picture of how that could affect this.

More confusingly, if I just add ML_ADMConstraints to magnetizedTOV-Baikal.par in Baikal, it doesn’t error out. I don’t have a guess for what is causing the behavior, or what triggers it in this parfile.

Comments (12)

  1. Roland Haas

    IO triggers a SYNC so if things are output it will SYNC. Also, there’s a forced SYNC before time level cycling.

  2. Samuel Cupp reporter

    That's what I thought, which is why I don't have any good guesses as to the source of the problem. Do you have suggestions for what I should look at Monday for this?

  3. Steven R. Brandt

    One thing I sometimes try when nothing is making sense is to run the code through valgrind.

  4. Samuel Cupp reporter

    I’ve made some progress in tracking down the source of this error. The if(time within times min/max) statement tracks back to checking if

    const CCTK_REAL time = cctkGH->cctk_time          (carpet/Carpet/src/Comm.cc:253)
    

    is inside

    times.AT(i) = t.get_time(ml2, rl2, tl2s.AT(i))    (carpet/CarpetLib/src/ggf.cc:540)
    

    which has the length of

    tl2s: tl2s.resize(prolongation_order_time + 1)    (carpet/CarpetLib/src/ggf.cc:379)
    

    In this parfile, the first several reflevels are not subcycling, so they should all step at the same iterations. I’m printing out a lot of stuff since its inside these functions, so the first times I’m seeing for reflevel 0 are 0.15, and 0.3 (presumably its also running at the rk4 half-step, which is why i’m getting two times for each iteration on a single level). If I run it with PreSync off, I see these same times appear for all the ones moving in lockstep, and it then goes into smaller time discretizations once it reaches the subcycled reflevels.

    However, this does not happen with PreSync. Instead, when it gets to reflevel 1, cctk_time returns a value of 1 instead of 0.15, which is not inside the bounds of that vector of times, causing the error. So somehow PreSync is affecting what cctkGH->cctk_time is.

    I’m going to try to construct a simpler test case that still reproduces this now that I have concrete behavior to look for to making debugging easier.

  5. Samuel Cupp reporter

    So, the heart of the problem comes from (I think) the function

    th::regrid()
    

    in repos/carpet/CarpetLib/src/th.cc:51. When this runs, it sets the

    deltas.AT(ml).AT(rl)
    

    object to 1.0 (possibly divided by the timereffacts), which then causes an incorrect cctk_time. This seems to be polluting some functions which call get_time. This happens without PreSync, but by miracle (design?) it gets reset to the right value before anything tries to use it. This does not occur in time with PreSync.

  6. Samuel Cupp reporter

    To expand on this, the code in regrid() is

    void th::regrid() {
      CCTK_REAL const basetime = 0.0;
      CCTK_REAL const basedelta = 1.0;
    
      const int old_mglevels = times.size();
      times.resize(h.mglevels());
      deltas.resize(h.mglevels());
      for (int ml = 0; ml < h.mglevels(); ++ml) {
        const int old_reflevels = times.AT(ml).size();
        times.AT(ml).resize(h.reflevels());
        deltas.AT(ml).resize(h.reflevels());
        for (int rl = 0; rl < h.reflevels(); ++rl) {
          if (ml == 0) {
            deltas.AT(ml).AT(rl) = basedelta / reffacts.AT(rl);
          } else {
            deltas.AT(ml).AT(rl) = deltas.AT(ml - 1).AT(rl) * h.mgfact;
          }
    

    As seen here, in line 14 deltas is always 1.0/reffacts, which is just 1 for the coarser grids. For PreSync, some synchronization/BCs are applied during the short period in which this is 1, and that doesn’t occur without PreSync. I believe this is where the incorrect time slips in. Since it quickly resets the deltas, most of the times afterward are correct. It’s only until later with the next call to CycleTimeLevels that this inconsistency finally breaks the code.

    I believe that basedelta should be set to the dt of the coarsest timelevel. At the very least, doing this explicitly for my case fixes the error. I haven’t verified that the codes produce the same results, but it seems like a reasonable choice instead of a random default of 1.

  7. Roland Haas

    I woukd say @Erik Schnetter will have to weigh in on this one. Caroet does not actually itself know the timestep initially and squirrels it away when it is first set by thorn Time. Assuming basedelfa to be 1.0 does seem wrong though.

  8. Samuel Cupp reporter

    If I remember correctly, the timesteps for each reflevel have already been set by the time this runs. It’s the regridding that causes it to temporarily forget what the deltas are.

  9. Samuel Cupp reporter

    Following up on this ticket, I know you made a branch with a partial fix, Roland. Do you think that’s ok to push in, or do we need to do something more? Since the behavior is broken now, I don’t see why we shouldn’t at least merge in a partial fix.

  10. Samuel Cupp reporter

    Roland, you asked for a parfile to reproduce the behavior, but it looks like I included one in the original ticket. To have the reads/writes for BaikalVacuum, you will want to checkout my scupp/BaikalRDWR branch of wvuthorns.

  11. Log in to comment