Updating the test suite to work with yt-3

#304 Merged at 897b6a6
Repository
ngoldbaum
Branch
week-of-code
Repository
enzo
Branch
week-of-code
Author
  1. Nathan Goldbaum
Reviewers
Description

This updates the test suite to work with yt-3.

It passes the push suite on my laptop. I haven't tried running the full suite.

If this is merged, our minimum required yt version moves up to yt-3.2.2 or newer (which is unreleased, but we need this since I had to update a couple of places in yt's Enzo frontend to get the ShockTubeTest to work). See yt PR 1781.

Comments (33)

  1. Nathan Goldbaum author

    Sam tells me over GChat that he didn't update the test suite to work with yt-3 over performance concerns. We should probably compare the runtime with the 2.x test suite and in this PR before it's merged and evaluate if there is a slowdown and if there is either try to fix it or determine whether it's a big enough problem to worry about.

    1. MattT

      Probably reasonable performance concerns; I suspect they're better now, but not fully remedied.

      1. Brian O'Shea

        @ngoldbaum , would you mind looking into this and reporting back? Even if there's a moderate performance hit, it's probably worthwhile to update!

        1. Brian O'Shea

          Followup on this: I'm seeing a tremendous slowdown when I compare the test suite with yt-2.x to the test suite with yt-3.x. Using the changeset in yt PR 1781 and this enzo PR, when I run the test suite with this line:

          "python ./test_runner.py --suite=push -o /Users/bwoshea/Desktop/new_gold_standard/ --answer-name=push-suite --local --test-only"

          (i.e., just doing the tests, having generated the data previously), the yt-3.x test runner takes anywhere between 100-400 times longer for any individual test (which can be comprised of between 1 and ~30ish sub-tests). I haven't done any profiling to drill down to where exactly this is happening, but it is definitely a problem - it's a difference between the push suite taking less than a minute (for the tests only) to taking at least a couple of hours. So, I think we really need to hold off on this PR until this performance issue is resolved. I suspect it's on the yt side of things, though - do we want to bump this to the yt-dev list?

          1. Nathan Goldbaum author

            Any chance you can upload a profile somewhere? You should be able to get one by doing python -m cProfile -o prof.out before ./test_runner.py in the command line example you gave.

            1. Nathan Goldbaum author

              So it looks like basically all of the time is spent in one yt function:

              52300   1823.3686   0.0349  1828.2117   0.0350  /Users/goldbaum/Documents/yt-hg/yt/data_objects/static_output.py : 570 ( _add_object_class )
              

              I'm looking at the implementation of that function now...

                1. Nathan Goldbaum author

                  As far as I can see, there seems to be a regression in the speed of the EnzoSimulation class, which is used heavily in the test suite.

                  I'm now installing yt-2 to see if I can make a usable benchmark, and will open a yt issue about it.

                  1. Brian O'Shea

                    Ouch, that's rough. Sorry for being gone for a while - if you need me to do any additional profiling, please let me know.

                        1. Brian O'Shea

                          I've pulled in these two PRs (in addition to the further original PR) and have verified that when using yt-3 the total time (using the ... --test-only option in the test suite, which isolates the tests themselves) dropped from ~8,000 seconds to 120 seconds (compared to ~60 seconds the yt-2-based test suite). So, big win there, and that's a reasonable amount of time.

                          However, here's something that was actually also a problem in the original PR: it doesn't seem to run all of the tests, and some of the ones that run seem to fail in odd ways. There are 69 separate simulations run in the "push suite", with ~30 individual tests run per simulation in the yt-2 version of the test suite, for a total of 1779 simulations as reported by test_runner.py. The yt-3-based test_runner, with the exact same command line, only runs a total of 253 tests, of which 5 error and 13 fail. Manual inspection of the failures (where the relative difference in datasets values between the 'gold standard' and this version of enzo exceeds 0.001) show arrays with utterly different answers, which makes me think that something funny is happening in the comparisons.

                          Pastebin of yt-2 test runner test_results.txt (note: it seems to be too large to paste the entire thing, so this is just the first half or so):

                          http://paste.yt-project.org/show/5928/

                          Pastebin of yt-3 test runner test_results.txt:

                          http://paste.yt-project.org/show/5927/

                          Pastebin of the raw output of the test runner (i.e., dumping stdout and stderr of the test runner into a file):

                          http://paste.yt-project.org/show/5929/

                          I tried running the test runner with --verbose turned on, but no additional useful output was provided. :-/

                          1. Nathan Goldbaum author

                            I don't have time to look closer at this this week. If someone else wants to take this up, feel free. I will come back to this later.

                            I think all the tests are running, it's just an issue with tests being yielded or not.

                            1. Nathan Goldbaum author

                              Also you'll likely need to generate a new gold standard, since some of the test names have changed.

  2. Brian O'Shea

    Hi @ngoldbaum , I think we're now ready to revisit this PR and get it merged into enzo-dev. I'm seeing a lot of tests failing, and hope that you can take a look at this. I'm generating the gold standard for the "push" suite using the current tip of enzo-dev (i.e., with the yt-2-based test_runner.py file) and then I have a second repository that contains the same repository but with this PR pulled in. When I run the actual test suite I get a wildly different number of tests, which I now think I understand. However, I also get many failed tests as well: I see a lot of this error:

    ERROR: Failure: AttributeError ('EnzoSimulation' object has no attribute 'length_unit')

    and also variants of this error:

    ERROR: VerifySimulationSame_./AMRZeldovichPancake.enzo_all

    where the key line in the traceback seems to be:

    YTNoOldAnswer: There is no old answer available.

    This latter problem seems particularly odd because I've verified that for those calculations that there are identical datasets in both the gold standard and new simulation sets, and that the same datasets are there.

    For reference, all of the output from calling the test runner for the PR'd version of the code can be seen here:

    http://galactica.pa.msu.edu/~bwoshea/data/test_runner_all_output.txt

    And, for reproduction purposes, here are the commands I used:

    Gold standard generation (tip of enzo-dev with yt-2):

    python ./test_runner.py --suite=push -o /Users/bwoshea/Desktop/Enzo_PR_Tests/test_suite_enzo-3 --answer-store --answer-name=push_suite --local

    Test of this PR (this PR with yt-3):

    python ./test_runner.py --suite=push -o /Users/bwoshea/Desktop/Enzo_PR_Tests/test_suite_enzo-dev --answer-name=push_suite --local --strict=high

    Would you mind trying to reproduce this error?

    1. Nathan Goldbaum author

      I'm traveling back from Canada today. I will try to look at this in the next couple of days.

    2. MattT

      From what I can tell, EnzoSimulation does have the length_unit attribute, but only when it is a cosmological simulation. @brittonsmith can you look at that? I'm looking at line 103 of yt/frontends/enzo/simulation_handling.py which happens inside the conditional.

    3. Nathan Goldbaum author

      So I think the issue is that you're generating the answers on yt-2, then running the tests with yt-3. I doubt that will work, since we didn't do anything in the yt-2 to yt-3 transition to ensure the testing infrastructure could load test answers produced by yt-2.

      It's been a while since I looked at this PR so bear with me, but this morning I'm trying to generate a set of answers using yt-3 and then verifying that the tests all pass that way. I'm also going to try to take a shot at generating more verbose output from the tests so you can verify that the test coverage is the same as it was under yt-2.

  3. Nathan Goldbaum author

    Found the issue that caused the test counts to not match. Was an easy fix in the end, updates incoming...

    1. Nathan Goldbaum author

      I've fixed this issue and updated the docs. This should be ready for review.

      Let me know if regenerating the gold standard after merging this change is going to be an issue, or if we need to make a note about it somewhere in the docs.

      1. Brian O'Shea

        Hi Nathan,

        I've pulled in all of the changes from all three of your PRs, and have run the test suite using yt-3 (for both the gold standard and the comparison, which is the same codebase but with an extra printf() thrown in so that the changeset hash is different). I get this set of errors for both the gold standard run and the comparison run:

        http://galactica.pa.msu.edu/~bwoshea/data/test_results_yt3_2016_06_28.txt

        which includes a ton of errors like this:

        Toro-1-ShockTube__test_standard.test_standard: ERROR (<type 'exceptions.ValueError'>, ValueError('bad axis2 argument to swapaxes',), <traceback object at 0x10eedfe18>)

        as well as a couple of these:

        PhotonTestAMR__test_amrphotontest.test_amr_photon_test: ERROR (<class 'yt.utilities.exceptions.YTFieldNotFound'>, YTFieldNotFound(), <traceback object at 0x10ac36368>)

        The former set of errors seem to go along with 1D and 2D simulations. I'm not sure what's up with the latter error; I've manually verified that the two different simulations have the same set of fields, though when I look at the associated python testing script (PhotonTestAMR__test_amrphotontest.py) I'm reasonably confident that the field 'H_p2_Fraction' does not in fact exist and 'H_p1_Fraction' should have a lower-case 'f'. As an additional clue, the equivalent script in the PhotonTest directory doesn't have that set of problems. In that directory, the fields are:

        _fields = ('HI_Density', 'HII_Density', 'HI_kph')

        instead of:

        _fields = ('Density', 'H_p1_Fraction', 'H_p2_Fraction', 'HI_kph')

        HOWEVER, what is more alarming is that the PhotonShadowing test didn't run, and in fact segfaulted for some reason... however, there are four tests in that file that involve the PhotonShadowing test that all passed. That is worrying - do you have any idea what's going on there?

        1. Nathan Goldbaum author

          I didn't see those yesterday, but then again I only ran the quick suite. Let me try again with the push suite.

        2. Nathan Goldbaum author

          I was able to reproduce some of these, but not all. Here are the test results I see:

          https://bpaste.net/show/bde86b6f7c97

          All of the failures I see are due to RefineBy still being 3 in two tests. That will be fixed when PR 340 is merged

          For the sake of reproducibility, I generated my answers using yt version 3ca8a739e529, grackle 2.1, and enzo version 1c18d141262e. The latter version has been pushed to this pull request. I compared with 1c18d141262e by making a trivial change to the repo with 1c18d141262e checked out, as you did.

          Let's go through the failures you saw one by one:

          Toro-1-ShockTube

          I can't reproduce this one at all. The test runs fine both when I generate the answers and when I compare with the gold standard.

          PhotonTestAMR

          I see this as well. I've fixed it in 7656f380ba4b.

          PhotonShadowing

          I don't see a seg fault when I run PhotonShadowing and it runs to completion.

          1. Brian O'Shea

            Sorry for the delay in response. I'll look into this later today and report back. By the way, what type of computer/operating system are you running on? Some Linux variant or OS X?

            1. Brian O'Shea

              OK, following up on this... it turns out that the PhotonTest problem (run not executing) is actually a compiler issue. I recently updated to gcc 5.2.0, and when one uses opt-high this leads to problems with NaNs in the PhotonTest problem (but in no other test problems, interestingly enough). So, that was easily dealt with by running with opt-debug.

              That said, I still see 52 errors in the test suite when I generate both the gold standard and the "test" version of the test problems with yt-3 using the following arguments to the test runner:

              python ./test_runner.py --suite=push -o /Users/bwoshea/Desktop/Enzo_PR_Tests/test_suite_enzo-dev --answer-store --answer-name=push_suite --local --strict=high --verbose

              The errors that I see are all of this variety:

              ERROR (<type 'exceptions.ValueError'>, ValueError('bad axis2 argument to swapaxes',)

              The full output of the test runner can be found at http://galactica.pa.msu.edu/~bwoshea/data/test_results_2016_06_30.txt . When I turn on the --verbose option when I run the tests, I get a more extensive error trace that looks like this:

              ======================================================================
              ERROR: GridValues_data0011_all_Density
              ----------------------------------------------------------------------
              Traceback (most recent call last):
                File "/Users/bwoshea/Desktop/yt-conda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
                  self.test(*self.arg)
                File "/Users/bwoshea/Desktop/yt-conda/src/yt-hg/yt/utilities/answer_testing/framework.py", line 327, in __call__
                  nv = self.run()
                File "/Users/bwoshea/Desktop/yt-conda/src/yt-hg/yt/utilities/answer_testing/framework.py", line 548, in run
                  hashes[g.id] = hashlib.md5(g[self.field].tostring()).hexdigest()
                File "/Users/bwoshea/Desktop/yt-conda/src/yt-hg/yt/data_objects/grid_patch.py", line 83, in __getitem__
                  tr = super(AMRGridPatch, self).__getitem__(key)
                File "/Users/bwoshea/Desktop/yt-conda/src/yt-hg/yt/data_objects/data_containers.py", line 272, in __getitem__
                  self.get_data(f)
                File "/Users/bwoshea/Desktop/yt-conda/src/yt-hg/yt/data_objects/data_containers.py", line 1178, in get_data
                  fluids, self, self._current_chunk)
                File "/Users/bwoshea/Desktop/yt-conda/src/yt-hg/yt/geometry/geometry_handler.py", line 245, in _read_fluid_fields
                  chunk_size)
                File "/Users/bwoshea/Desktop/yt-conda/src/yt-hg/yt/frontends/enzo/io.py", line 161, in _read_fluid_selection
                  rv[(ftype, fname)] = gds.get(fname).value.swapaxes(0,2)
              ValueError: bad axis2 argument to swapaxes
              

              This happens for a bunch of different test problems, which are all 1D and 2D test problems. I've tried this with my original installation of yt-3 installed ages ago with the old non-conda install script (changeset 428dd2f17465), as well as a fresh install of yt-3 that I installed this morning using the conda-based install script (also changest 428dd2f17465). Both have this problem. When I turn on the verbose option in the test runner, I get the contents of this file: http://galactica.pa.msu.edu/~bwoshea/data/all_test_runner_output_2016_06_30.txt . Looking through this in detail, I see that all of the errors have to do with GridValues-type tests.

              I suspect that I'm seeing this because I am using the --strict=high argument to the test runner and you aren't. Setting --strict to "high" requires that the bitwise tests get called, and when I dig into the yt codebase I see (in yt/frontends/enzo/answer_testing_support.py) that the GridValuesTest only seems to get called in that circumstance. So, if you're not using --strict=high, that would explain why you're not seeing the error. Would you mind trying to reproduce this?

              1. Brian O'Shea

                Actually, answering my own question by looking at your pasted output, at the top of your paste it has:

                Relative error tolerance: 1e-3
                Bitwise tests not included
                

                So that's what's going on. If you turn on --strict=high I bet you'll see my problems. Once that's resolved, we can accept this PR!

                1. Brian O'Shea

                  Awesome, thank you. I'll give this a try ASAP and let you know how it goes. Just to check, though - if I just merge this PR into my own yt distribution (installed via conda) do I have to do anything other than that? Do I have to do any sort of "python setup.py develop"-type thing?

                  1. Nathan Goldbaum author

                    If your running out of a repository (and not the conda binary packages or the nightly binary builds) then yes, that's all you need to do since that PR only modifies python code.

                    1. Brian O'Shea

                      Awesome, thank you. Also, a followup: when I pull yt PR #2258 into my yt repository, all tests pass with flying colors. Thank you, @ngoldbaum , for your work on this one. I'm going to hit approve on both this PR and the yt one, and we can pull the trigger on this when a couple more people sign off. @MatthewTurk , @brittonsmith ?

  4. John Regan

    @bwoshea @ngoldbaum I had a look through the changes to the python scripts and docs and it all looks OK to me. From the source code the changes look pretty minor - just a simple update to YT-3. However, from reading back through the comments it was obviously a lot more work than that! Excellent work!

  5. Brian O'Shea

    that's 3 approves; time to merge. Nice job on this one, @ngoldbaum - this was a tremendous amount of work. Thanks also to @john_regan and @MatthewTurk for reviewing!