#308 Merged at dd8b7e8
Repository
MatthewTurk
Branch
yt
Repository
yt_analysis
Branch
yt
Author
  1. MattT
Reviewers
Description

This pull request includes an answer testing plugin for Nose, as well as a mechanism by which this plugin can be used to upload new results and compare existing results to a gold standard, stored in Amazon.

How does Answer Testing work now?

Currently, Answer Testing in yt works by running a completely home-grown test runner, discoverer, and storage system. This works on a single parameter file at a time, and there is little flexibility in how the parameter files are tested. For instance, you cannot select fields based on the code that generated the pf. This catches many but not all errors, and can only test Enzo and FLASH.

When a new set of "reliable" tests has been identified, it is tarred up and uploaded. No one ever really used them, and it's difficult to run them unless you're on Matt's machine.

What does this do?

There are two ways in which this can function:

  • Pull down results for a given parameter file and compare the locally-created results against them
  • Run new results and upload those to S3

These are not meant to co-exist. In fact, the ideal method of operation is that when the answer tests are changed intentionally, new gold standards are generated and pushed to S3 by one of a trusted set of users. (New users can be added, with the privs necessary to push a new set of tests.)

This adds a new config option to ~/.yt/config in the [yt] section: test_data_dir, which is where parameter files (such as "IsolatedGalaxy" and "DD0010" from yt's distribution) can be found. When the nosetests are run, any parameter files it finds in that directory will be used as answer testing input. In yt/frontends/enzo/tests/test_outputs.py is the Enzo frontend tests that rely on parameter files. Note that right now, the standard AMR tests are quite extensive and generate a lot of data; I am still in the process of creating new tests to replicate the old answer tests, and also slimming it down for big datasets.

To run a comparison, you must first run "develop" so that the new nose plugin becomes available. Then, in the yt directory,

nosetests --with-answer-testing frontends/enzo/ --answer-compare=gold001

To run a set of tests and store them:

nosetests --with-answer-testing frontends/enzo/ --answer-store --answer-name=gold001

We can now not only run answer tests, but we don't have to manage (manually or otherwise) the uploads. yt will do this for us, using boto. Down the road we can swap out Amazon for any OpenStack-compliant cloud provider, such as SDSC's cloud.

Additionally, we can now add answer testing of small data to Shining Panda. In the future, we can add answer testing of large data with lower frequency, as well.

What's Next?

Because there's a lot to take in, I'd like to suggest this PR not be accepted as-is. There are a few items that need to be done first:

  • The developer community needs to be brought in on this; I would like to suggest either a hangout or an IRC meeting to discuss how this works. I'd also encourage others to pull this PR, run the nosetests command that compares data, and figure out if they like how it looks.
  • The old tests all need to be replicated. This means things like projections turned into pixel buffers, field statistics (without storing the fields)
  • Tests need to be added for other frontends. I am currently working with other frontend maintainers to get data, but once we've gotten it, we need to add tests as is done for Enzo. This means FLASH, Nyx, Orion, as well as any others that would like to be on the testing suite.

I'd like to encourage specific comments on lines of code to be left here, as well as comments on the actual structure of the code, but I'll be forwarding this PR to yt-dev and asking for broader comments there. I think that having a single, integrated testing system that can test a subset of parameter files (as well as auto-discover them) will be extremely valuable for ensuring maintainability. I'm really excited about this.

UPDATE 1: Added separate bigdata/smalldata tests.

UPDATE 2: Re-enabled answer testing comparisons against gold001, added a big_data option to the test runner plugin, and also to the requires_pf decorator. Thanks to @Nathan Goldbaum for heads up on the typo!

UPDATE 3: This now works in Shining Panda. I think it's ready for acceptance at this time.

UPDATE 4: Fixes Nathan's reported bugs with the big data test

  • Issues YT-2

Comments (9)

  1. Britton Smith

    Matt just walked me through this and I think it works quite well. It is easy to see where the tests are setup and how they are selected to run with different datasets. In my opinion, this will make both running the tests and creating new ones a lot simpler and should lower the barrier for people getting involved. If we put all the datasets required for testing in the same location that will also help.

    This new framework does not allow for offline testing as the results downloaded from Amazon are only stored in memory. I can live with this for this iteration, but it might be nice in the future to have an option to download and store the comparison results.

    I encourage other people to check this out and see how much nicer it is.

  2. Nathan Goldbaum

    Hi Matt,

    This is really great. I like how this democratizes the testing process.

    Once this is accepted I'd like to add my yt repository on bitbucket to ShiningPanda so I get test results after every commit!

    After fixing the typo I pointed our below, a number of tests fail for me: http://pastebin.com/NzRNVw5V

    It looks like all of the failures are due to shape mistmatches in the comparisons. Do you have any idea what's going on?

  3. MattT author

    Hi Nathan,

    Yup, I have a great idea what's wrong -- I updated the tests, but not the test results! I think that means the system works. :) I've now updated the PR and the test results for gold001.

    I think this might mean we're ready to go once the Wiki goes up?

  4. MattT author

    Hi Britton, sure. It's just a toggle, really, and I named it big data because I wanted it to toggle whether we ran it on things that are likely to take a long time -- for instance, we don't want to run the galaxy0030 tests every single push to the repo, because they take ~5 minutes, but we do want to run the little data tests every time. This just toggles whether we do or not, and it is supplied on the command line. Does that make sense?