CarpetIOHDF5 too verbose while reading from checkpoint

Create issue
Issue #550 closed
Frank Löffler created an issue

It is not that uncommon that I recover using a different number of processors. Every time I do this I see the error file cluttered with messages like

WARNING level 1 in thorn CarpetIOHDF5 processor 21 host (line 640 of /work/00920/tg459479/Cactus/arrangements/Carpet/CarpetIOHDF5/src/ -> Variable AHFINDERDIRECT::ahmask on rl 0 and tl 0 not read completely. Will have to look for it in other files

I expect this, this is not an error and not really something to warn about. I acknowledge that this might have been introduced when recovering using the same number of processors was a problem and caused reading all files, but I don't think this is an issue anymore.

I propose to change the warnlevel for this message to CCTK_WARN_DEBUG(4). In addition it would be good to have one separate message with level CCTK_WARN_PICKY(3) if any variable/reflevel/timelevel could not be read completely (but not one for each of these), ideally only once for all processors. This would not clutter the output of the default simfactory runs (-L 3) too much, but would indicate that this happened - and in case this is a problem it's easy to enable -L 4.


Comments (14)

  1. Erik Schnetter
    • removed comment

    This message indicates a majore performance problem; sometimes, recovery can take one or two orders of magnitude longer when this occurs.

    There should be one L1 message for the first variable. The following messages could be L3.

    Maybe we should set the Simfactory default to L2? I expect L3 to be verbose.

  2. Frank Löffler reporter
    • removed comment

    I agree, one warning instead of many would be ok, even in the general case. But we should also state in the warning that this is expected for restarts from different numbers of processes (how difficult would it be to check for that?).

    The remaining messages could then be L4, not L3. They are only useful if you would like to debug a problem, and CCTK_WARN_DEBUG(4) sounds like made for this.

  3. Ian Hinder
    • removed milestone
    • removed comment

    There doesn't seem to be progress on this, and the release is in a few days. This was a problem in the last release as well, so there is no regression. Removing milestone.

  4. Ian Hinder
    • removed comment

    If you recover on a different number of processors, then the messages are irrelevant - you already know that the system has to check other files. At most, I would think a single message along the lines of "Warning: recovering on a different number of processes is often slow" would be all you would need. If you are NOT recovering on a different number of processes, then the messages indicate a serious unexpected performance problem. There is the related issue of this happening because of Carpet choosing a grid structure at runtime different to what was in the checkpoint file (see I do not know if this has been fixed.

  5. Erik Schnetter
    • removed comment

    These warning messages were put there because they were explicitly requested as a feature. I do not want to remove them without a discussion that includes the people who requested them, because otherwise we will run in circles. I am asking the person(s) who request to remove these messages to ensure such a discussion happens. I do not care about trac vs. mailing list vs. chat vs. telecon vs. mind reading -- a few lines of summary in trac will be good enough.

  6. Frank Löffler reporter
    • removed comment

    AFAIR there was an agreement that at least these warning messages should only appear once - not in the amount that currently happens. Even if there is a problem a cluttered log doesn't improve anything. Thus, I proposed to keep one message level 3 and change the others to level 4 to easily be able to debug things if necessary. We can make the level 3 message level 1 in case the number of processes is the same as the number of files we recover from.

    Do you agree with that?

  7. Frank Löffler reporter
    • removed comment

    Quote Christian:

    Reducing the many warnings to just one warning is a good idea! (and perhaps having many warnings as an option -- it's sometimes useful for debugging).

  8. Roland Haas
    • changed status to open
    • removed comment

    The attached patch (either one...) reduces the verbosity. It will output the first warning on level 2 (warn_complain) then the further ones as warn_debug. The later ones are also reduced in that it "only" outputs one warning per variable, per refinement level and per timelevel (so still 3000 for a bbh simulation).

    Ok to apply or do we want to fiddle with the verbosity even more?

  9. Log in to comment