use valgrind in test suite

john bachan

No sane engineer would disagree with this suggestion :)

This is probably the perfect time to make the push for this to happen since the legacy CI system is being overhauled quite a bit at the moment. Historically, valgrinding upcxx was cumbersome because GASNet was reporting a slew of false positives. But it looks like it has cleaned up its act since and I last checked and should re-engrain the practice of "trust nothing til valground" mantra into my skull.

And I would like to add my personal thanks for reporting ~~#369~~. That little bit of diligence caught a big bug.

2020-05-23T14:15:12+00:00

Dan Bonachea

marked as minor

Sorry to rain on the parade, but Valgrind for PGAS programs is not as straightforward a win as it might seem as first, even if one ignores/avoids portability issues.

Most of GASNet's dynamic allocations are either permanent or memoized into custom-managed freelist queues, so a large amount of reachable memory at exit is "normal" for conduit internal usage. Some of these are annotated in the GASNet valgrind suppression file, ie: valgrind --suppressions=gasnet.supp ..., but that file is not carefully maintained and thus probably out-of-date. FWIW GASNet also has its own internal private heap tracking utility (see "GASNet debug malloc services" in the README) - this is used by the Berkeley UPC runtime (which is C-based) but the UPC++ runtime is not currently hooked into it.

valgrind has some gotchas with respect to fork() (which smp-conduit uses for spawning), so invoking it correctly might require some voodoo. AFAIK it is incapable of tracing reachability across processes in shared memory segments - which could theoretically result in some false positive leaks reported for private heap objects only reachable from the shared heap. It also fails to correctly handle the XPMEM memory sharing mechanism we use on the Cray XC, which results in a LARGE number of invalid read/write or uninitialized value warnings when used there with even simple programs.

Perhaps more importantly, valgrind is not going to do anything at all for leaks of objects allocated in the shared heap (or in memory kind segments). Unfortunately shared heap is often the more precious resource; as in addition to virtual pages, objects in the shared heap also usually consume locked physical page frames, OS memory-sharing resources, and NIC registration resources. Valgrind and other external tools will see the shared heap segment as just one giant opaque memory region. Even if one could somehow plug our shared heap allocation events into valgrind, solving the common-case reachability problem in the shared heap really is a distributed problem that also requires understanding the global_ptr representation, so I'm pretty sure that's a non-starter.

Nevertheless I agree in principle that some (manual or automated) valgrind monitoring on some carefully chosen tests/configs could help discover a class of private memory leaks in the runtime (issue ~~#369~~ being one example).

2020-05-23T19:05:58+00:00

john bachan

I agree with those technical points but totally disagree that this constitutes rain on the parade intentioned here, and would also like to see this issue promoted back to major.

The issues you raise mostly support why valgrind can't see in the shared heap and therefore is if limited use to idiomatic pgas apps. But the upcxx runtime is not itself like a pgas app. 99% of it's allocations are private heap and the pointers are never communicated, or if they are, then they are expected to be reaped before finalization. I think getting some machine checked confidence there would help me sleep a lot easier at night. Isue 369 clearly demonstrates that letting me fly blind brings hazards

2020-05-23T19:59:29+00:00

Dan Bonachea

changed status to resolved

issue ~~#372~~: Add valgrind leak check wrapper support to run-tests

Make targets that run multiple tests now have the option to wrap the runs in a valgrind leak check. This greatly slows execution time thus is disabled by default, but can be enabled using a command like:

 make check RUN_WRAPPER='$(VALGRIND_WRAPPER)'

Where check may instead be run-tests, dev-check, etc.

Resolves issue ~~#372~~

→ <<cset ab83874cd3e2>>

2020-07-09T04:39:12+00:00

Comments (4)