Carpet commit f33caf3aba96 (hg) breaks timer output in CCTK_Terminate

Create issue
Issue #477 closed
Roland Haas created an issue

the commit to make Cactus timers hierarchical breaks eg. the Multipole testsuites with an error:

cactus_devel: /localdata/rhaas3/Cactus_devel/configs/devel/build/Carpet/ void Carpet::TimerNode::stop(): Assertion `this==d_current' failed.

From the comments around the assert (thank you for those, Ian). It seems as if the patch tries to enforce that timer calls are properly nested ie.

start_timer_A start_time_B stop_timer_B stop_timer_A

When I played with timers myself (to write a thorn that traces how the code proceeds through functions and thus wanted to produce a call and timing graph) I found a small number of basic Cactus timers that are not nested like that (Initialization and/or Evolution I think). If this is still the case it would explain the behaviour.

I attach the log file of testsuite (Multipole/test_22.par).

Backing out of f33caf3aba96 (by applying in inverse patch, hg backout actually gave me conflicts...) makes the testsuite work again.


Comments (11)

  1. Erik Schnetter
    • removed comment

    The state "review" means that the problem has been solved in principle, and that there is a patch attached that needs to be reviewed before the ticket can be closed.

    Are you thus suggesting to back out the hierarchical timers? This would remedy the immediate problem.

    To my knowledge, all timers are perfectly nested. Can you try to find a case where a timer is not nested? This may be a problem in the code.

  2. Roland Haas reporter
    • removed comment

    Then it is in the wrong state (is there a state to ask someone who understands to code to look into it?), sorry. I do not want to suggest to back out the hierachical timers. If this really caused by a single pair of non-nested timer (in the flesh) then I would suggest to fix the order of those timers and to also make the assertion into a warning of high level (if the code can actually handle non-nested timers). As far as I undertand it, the documentation does not state that timer calls must be nested. So it might not be a good idea to enforce this too violently.

    I will do some more tests to find out if there is a non-nested timer and report back.

  3. Erik Schnetter
    • removed comment

    Any ticket is a request asking others to look into things. No special state needed for this...

  4. Roland Haas reporter
    • removed comment

    alright, I tracked it down to a return statement in Carpet/Carpet/src/ line 65 which exits OutputGH after start() but without calling stop(). This only triggers if there are no IO methods (so never occured in real runs I assume). Solutions are obviously to add the timer.stop() call right before the return, or to move the timer.start() call after the if statement and return. Best might be though to have helper class that calls start() in its constructor and stop() in its destructor to time blocks (so that the compiler makes sure that stop() is called):

    class timedblock { public: timedblock(Timer& timer_) timer(timer_) {timer.start();}; timedblock() {timer.stop();}; private: Timer& timer; }

    I have attached a patch to minimally call stop() before the return in OutputGH and some sanity checks in as well as replacing the assert in with a CCTK_WARN. All compile and let me run the test_22.par. They produce the proper error message when I undo the fix as well. Ok to apply?

    Looking at the timer classes I noticed that most of the ones dealing with names take and return string objects (rather then references to strings). Since this means that temporary storage for the strings has to be allocated (on the heap via new), it might be a better to use references to string objects to avoid the new/delete pair for timers that are measuring very short intervals.

  5. Ian Hinder
    • removed comment

    There is no requirement that the Cactus timers be properly nested. However, the changes I made require that objects of the Carpet "Timer" class are used in a nested manner. Those objects use the Cactus timers. As far as I know, timer starting and stopping is lexically correct throughout Carpet. i.e. every Timer.start is followed by a Timer.stop before any timers are started without stopping them. It is possible that something escapes out of a Timer.start .... Timer.stop sequence between the start and stop without calling the Timer.stop code. We should figure out what is causing this and see if we can ensure that it doesn't happen. So in summary: it cannot be timers in the flesh that are called out of sequence, it must be Carpet Timer objects. We should add code to that assert statement to print the names of the two timers. The TimerNode class has a d_name member which gives the name of the timer. We could also add debug output to the start and stop methods of TimerNode to print the names of timers being started and stopping. This should reveal the problem. I don't actually have a checkout of the development version of Carpet to test with at the moment so it's not a quick thing for me to look at right now.

    edit: this was written while Roland was typing the previous comment!

  6. Ian Hinder
    • removed comment

    I did think about using some sort of C++ scoping construct as you suggested to ensure that timers were deleted automaticall. I think I wanted to make the minimal change to the Timer API so that all the existing timer usage didn't have to be changed. One thing I'm not sure about: can you be sure of the order in which the destructors are called if you have more than one timedblock in scope? If not, then you would need to have multiple levels of {...} nesting when you have more than one timer, which would be ugly.

    What happens to the reference if you return a reference to the string, and the TimerNode object subsequently gets deleted? The string will have been deleted (it was part of the TimerNode object) but the reference will remain. In general, the Carpet timers are used around large chunks of code so they should not be performance critical. However, there are some parts of CarpetLib which I think use the timers which this might not apply to.

  7. Roland Haas reporter
    • removed comment

    Apparently the order of destruction is the reverse order of construction (which is useful to know): or formally: (stackoverflow has section numbers) so multiple timers are fine it seems.

    As for references: (as I understand it) if getName() returned a reference and a user assigned this to a reference and then deleted the owning TimerNode then the reference would be a dangling pointer (essentially). Something like

    char *a = new char('a'); char &b = *a; delete a; std::cout << b << std::endl; will segfault (or not...).

    If on the other hand you assign to a string then copying takes place (even if a reference is returned) and everything is fine. In the case where you return a string object but the user assigns to a reference I think you end up with an invalid reference right away, and g++ will not even compile it:

    1. include <string>
    2. include <iostream>

    std::string blurp("blah"); std::string blah() { return blurp; }

    int main() { std::string &foo = blah(); return 0; }

    So my approach would be that the segfault can only occur the user assign to a reference in which case the user knows what she is doing and we supply her with sufficient rope to hang herself if she so chooses :-)

    But as you say, it very likely makes no difference at all unless one times very short, local functions.

  8. Erik Schnetter
    • removed comment

    Strings in C++ are reference counted and have (almost) value semantics, similar to integers. It is very rare to have references to strings; one would use these mostly to return values from a routine. Copying a string is a cheap operation, probably no more expensive than copying a pointer (to the string descriptor) and incrementing a counter (in the string descriptor). This incrementing must be atomic in a multi-threaded environment (such as Carpet), but I still expect a string copy operations to be no more expensive than a few machine instructions.

    Regarding the timed-block construct: it is useful to hide these in a macro, similar to loopcontrol or the various BEGIN_REFLEVEL_LOOP constructs. The END_... parts are not strictly necessary -- I added them only for documentation; you could design a timed region without it.

  9. Log in to comment