Show who tests what

Issue #170 new
andrea crotti
created an issue

I was just using the awesome HTML report to see my test coverage and I had the following thought.

Wouldn't it be nice to be able to see easily what parts of the test suites are actually testing my code?

I guess that these information is collected while doing the annotation, right?

In this way we could actually see if the tests are actually good very easily, which is specially important when working with other people code.

Comments (43)

  1. Ned Batchelder repo owner

    This is a very interesting idea, one that figleaf pioneered with "sections". Right now we don't collect this information. The trace function would have to be modified to walk up the stack to identify "the test", then that information would have to be stored somehow. Then the reporting would have to be changed to somehow display the information.

    That's three significant problems, but only three! Do you have ideas how to do them?

  2. andrea crotti reporter
    • changed status to open

    Well I need to dive more into the internals to suggest something that makes sense, however I see that the stats struct is this:

    struct { unsigned int calls; unsigned int lines; unsigned int returns; unsigned int exceptions; unsigned int others; unsigned int new_files; unsigned int missed_returns; unsigned int stack_reallocs; unsigned int errors; } stats;

    which probably doesn't help much my idea, because I think we would need to associate every *line* to a list of lines that are testing it.

    So for example

    silly_module.py def silly_func(): foobar()

    silly_test.py: assert silly_func()

    silly_test2.py: assert silly_func()

    I should have that

    silly_func:0 = [silly_test.py:0, silly_test2:0] silly_func:1 = [silly_test.py:0, silly_test2:0]

    I'm afraid that it would be a awful lot of information to store if the project gets really big though.

    For the reporting I imagine just to add a clickable button near every line that opens up a page that collects the different tests that run that line, with some context around.

    That should probably be the easier part, even if I'm not really a good web-developer at the moment..

  3. Ned Batchelder repo owner

    I don't think we need to collect all the *lines* that test product lines, we need to collect the *tests* that test product lines, which reduces the data collection a bit, but it will still be a challenge.

  4. andrea crotti reporter

    For the tests you mean the code object of the test function?

    In that case I agree, because it should keep track of original file / line where it's defined if I remember correctly.

    Anyway another possible usecase which of this feature is checking if unit-tests are really unit tests.

    If I see for example that module a.py is tested by test_a.py but also from test_z.py which has almost nothing to do with a.py, then there is something wrong, and a system to visualize it would be nice..

  5. Thomas Güttler

    I guess you need this data structure to implement this:

    I use django ORM since it is what I know best, but SQLAlchemy might be a better solution for this. Storage to sqlite would be enough.

    class Line(models.Model):
        file_name=models.CharField()
        line_number=models.IntegerField()
    
    class StackFrame(models.Model):
        executed_line=models.ForeignKey(Line)
        lines_of_stack=models.ManyToManyField(Line)
    

    This structure needs to be filled with every line that gets executed by coverage.

    A HTML Report could be created by this data.

    I guess this is really slow... but who cares? For me it would be enough to run this once very week in a cron/batch job.

  6. Thomas Güttler

    About the ORM: Linus Torvald said sometimes: good programmers care about data structures. That's why I would implement this first. And the db structure in this case is quite easy. The implementation could use pure sqlite without ORM.

    Yes, the execution time would increase a lot. But I don't think an alternative to sqlite would be much faster.

    And: This is not intended to be run every time. We can optimize later.

  7. xcombelle

    Instead of inspecting the trace at each call to the trace function, I thought of something which could be faster:

    at the start of a test record on which test we are, and during the call of trace check which is the current test at the end of the test forget the current test

  8. Tibor

    I think the conceptual problem here is, that coverage.py has avoided the concept of "test case" and "test". It's the job of a test runner to define, discover, instantiate, execute them.. And each test runner has a slightly different definition of what is a test and what not..

    e.g. unittest has this definition of a test: methods of a unittest.TestCase subclass beginning with letters "test"

    Other test runners have different definitions... E.g. pytest is very flexible and you can configure almost anything to be a test..

    The practical solution might be that coverage.py:
    a) does detection of tests during runtime if executed under unittest and
    b) provides an API for annotating phases(sections/tests) for the rest.

    @Ned Batchelder do you see b) as a challenge also, or were you referring to a) as not beeing easy? :)

  9. Paul Sargent

    I do this kind of analysis in my day job all the time with other tools. What we normally do is store separate coverage results files for each test, and then we can do various bits of analysis like:

    • Calculate the union of all the coverage to get a total (obvious).
    • Find the top test which gets the most coverage
      • ...then rank the rest of the tests by how much additional coverage they each contribute.
    • Construct sets of tests, which are quick to run, but give high coverage because they're the top 10 (for example) ranked tests. Useful for commit checks when your full test suite takes an hour.
    • Identify tests which don't contribute any unique coverage.

    It all starts with having identifiable coverage for each test.

  10. Laurens Timmermans

    @Tibor : A while back I made a small proof of concept which basically does what you described under 'b'.

    I've uploaded the (extended) htmlcov of this proof of concept here. It basically provides a count ('covered by how many unique test-cases') and a heatmap kind of visualization to get an idea of which part of your code is touched most. The dropdown (called 'label') at the top and mouse-over in the column on the right allow selection/highlighting of test-cases.

    The test-suite and unittest.TestCase derived class which produced these results can be found here. The changes I made in coverage.py to support this are not there since they are really hacky and incomplete, but if anyone is interested; let me know.

  11. Ned Batchelder repo owner

    I think my preference would be to provide a plugin interface that would let a plugin define the boundaries between tests. In fact, it need not be "tests" at all. Perhaps someone wants to distinguish not between specific tests, but between directories of tests, or between unit and integration tests. Figleaf implemented a feature like this and called it generically, "sections".

    So the plugin could demarcate the tests (runners? callers? regions? sections? what's a good name?) any way it liked. Coverage.py can ship with a simple one that looks for test_* methods, for the common case.

    Any ideas about how to present the data? I'd like it to scale to 10k tests...

  12. Thomas Güttler

    @Ned Batchelder " Perhaps someone wants to distinguish not between specific tests, but between directories of tests, or between unit and integration tests"

    I think doing "Separation of concerns" here would be nice:

    First, collect the data as detailed as possible. Second, do aggregate the data.

    This way both can be done: distinguish between tests methods and distinguish between directories/sections.

  13. Paul Sargent

    @Thomas Güttler So my day job is verification of hardware designs, but really the fact that it's hardware is not important. We have tests and we have code under test. The analysis is done with the commercial hardware design tools we use, but the principles of what's done is relatively straight forward.

    Rather than put a lot of detail here, I've written a snippet

  14. Ned Batchelder repo owner

    @Thomas Güttler I agree about separation of concerns. That one of the reasons I'm leaning toward a plugin approach: it isn't even clear to me that "test methods" is always the finest granularity we need. Some people use coverage.py without a test suite at all, and they may have their own idea about where interesting slices begin and end.

    BTW: I like the name "slice" for this concept. It's the same word as "string slicing", but I don't think that collision is a problem. "Segment" is similar, but not as nice.

  15. Thomas Güttler

    @Ned Batchelder coverage.py usage without tests.... good catch. Yes, that was not on my mind.

    You are right, it should be flexible.

    method: stacktrace_to_??? (unsure how to call it)

    Input: stacktrace (list of nested method calls). output: ???. Maybe just a string. Example for the use case "store which line was executed in which test?": myapp.tests.test_foo.test_foo_with_bar

    The above use case would go down the stacktrace until it sees a method which starts with "test_....".

  16. Ned Batchelder repo owner

    I'd have to play around with possible plugin semantics. The challenge will be to support it in a way that doesn't require invoking a Python function too often, as that will kill performance.

  17. Thomas Güttler

    I would care about performance later. I like Testdriven development: red, green, refactor.

    This is going to be slow, and that is not a problem. In my case I want to create this only once a week. I think this can't be fast. There will be a lot of data created during running the test.

  18. Laurens Timmermans

    The abstraction I was thinking about is 'context', since that is to me what the plugin would add: the context in which each line was traced (e.g. test-case-name, test-type, filename, directory, or perhaps some even fancier 'context-object'). A plugin would need to provide at least 3 things: 1) something like a 'get_current_context' (which coverage.py then records along with the traced line); 2) a 'post_process' to allow some filtering/aggregation/restructuring or deriving statistics/metrics on the measured data and 3) an extension to the reporting.

    As for speed/performance: I'm not that concerned about the time it will take to collect/measure (note that there are likely to be far less contexts as there are measured lines); as @Thomas Güttler noted: you would probably only run this periodically. I'm actually more concerned about the performance in the reporting; more specifically: the HTML report.

  19. Ned Batchelder repo owner

    I appreciate the "make it work, then make it fast" approach. In the case of designing a plugin API, though, the details of the API could have a big effect on the speed. But I hear you: it could be fine for this to be only enabled occasionally, and slow is fine.

    @Laurens Timmermans Hmm, "context" is a good (if boring!) word... :)

  20. Chris Beaumont

    Hey there. I've been thinking about this issue lately, and thought it might be worth leaving some notes here. I've been working on a coverage wrapper called smother (https://github.com/chrisbeaumont/smother) based on the ideas I've seen on this ticket, @Kevin Qiu's nostrils repo, and the experimental WTW code in coverage.py's source. A quick summary of smother's approach:

    • Relies on using a test runner (currently pytest or nose are supported) to hook into the start of every test. Uses coverage to actually collect coverage information, and saves the information separately for each test. This is similar to nostrils except it uses coverage.py for tracing (so it's fast and robust). It's also like the WTW code in coverage.py, but relies on the test runner instead of the tracer to detect test boundaries (so it's fast and robust).
    • It builds a JSON file whose shape is {test_name: {file_name: [list_of_lines]}}
    • The CLI has 4 main commands: lookup (given a function name or line range, list which tests visit this code section), diff (given a git diff, list which tests might be affected), csv (dump a CSV of test/source code pairs for exploration), and to_coverage (build a vanilla coverage report)

    In answer to some of the questions in this thread:

    If you are going to record which line was tested by each test, what will you do as the code shifts around due to insertion and deletion of lines?

    This is primarily relevant for smother diff, where the smother data has been generated for an old version of code and then queried against a set of modifications. Smother takes the approach of mapping each line of code to a "semantic region" (essentially the smallest function or class block that contains that line). smother diff converts the set of modified lines from a changeset into a set of semantic regions, converts that to a set of line ranges on the old version of the code, and matches those line numbers to what's in the smother report. This conservative approach will match some tests that may not have actually touched a specific modification (if the modification was in an unevaulated branch, say), but has the benefit that these "semantic regions" are much more stable across changesets than line numbers are.

    Any ideas about how to present the data? I'd like it to scale to 10k tests...

    The inspiration for smother was a 11K test suite of a 100K line legacy codebase, and is reasonably performant (negligible time overhead, a somewhat-ungainly 100MB data file that could easily be optimized for size, and ~5 sec query times). I've experimented with different visualizations of smother's CSV output, but ultimately found that the lookup and diff commands are most useful -- for other exploration coverage's normal html report is sufficient. In other words, "Who tests what" feels most useful in the context of specific questions (what tests might I have just broken?).

  21. Tibor

    @Chris Beaumont For reference, I'll link here also http://testmon.org . pytest-testmon is a py.test plug-in which automatically selects and re-executes only tests affected by recent changes.

    I didn't have a time to look at smother yet. testmon uses a notion of "python code blocks" (probably something similar to smother "semantic region"). pytest-testmon also takes into account holes in the blocks. Which is described here in the second half: https://github.com/tarpas/pytest-testmon/wiki/Determining-affected-tests

    My anwher to the question:

    If you are going to record which line was tested by each test, what will you do as the code shifts around due to insertion and deletion of lines?

    would be that I think coverage.py doesn't need to care but if it really wants to it can store checksums of code blocks as implemented in testmon .

  22. xcombelle

    I don't get how you get the figure of C/4 more information to store. (I don't know neither how it is stored now) As I understand now you have to store all the lines executed. With new way you would have to store also the contexts where the line is executed so O(n) more information where n is the average number of simultaneous context.

  23. Ned Batchelder repo owner

    The way I've implemented the contexts so far, there is a separate data structure for each context. So I don't store a list of contexts for each line. Instead, each context has a subset of the line data. So the question is, what fraction of the full product's coverage will a single context be. I took a crude guess at 25%. Hence, C/4.

  24. xcombelle

    I realize both way to store data is equivalent and that the full project coverage with one test is heavily dependent on the grain of a test. For unittest, it is only a small part of the codebase which is tested by a test but for an integration test a bigger part is covered. So you are totally right that two order of magnitude of data might be necessary

  25. Log in to comment