Pull requests

#416 Merged
Repository
Deleted repository
Branch
yt (e1b7792e6807)
Repository
yt_analysis/yt yt
Branch
yt

Saving derived fields to disk

Author
  1. Andrew Myers
Reviewers
Description

I re-factored the derived field saving code so that all the changes were made to the base classes (StaticOutput, IOHandler, and AMRHierarchy). The only change now to the individual frontends is to rename the _read_data_set function to _read_data. This has been tested on all the different code frontends I have datasets for - Flash, Orion, Enzo, and Chombo. Since this does touch the other code frontends, it would be great if there were a way to make sure I'm not breaking anything there.

Other points:

  1. Currently, if you're working on a dataset called "fn", the code will save the field in "fn_backup.gdf", along side the "fn.yt" file. Is this dumb?

  2. I did have to re-arrange the gdf writer a bit to get this to work. The code passes the test in "test_writer.py", but Casey or someone might want to make sure I'm not messing anything up in there.

EDIT - I accidently overwrote some changes that read in periodicity information for the orion, flash, and enzo frontends. This is fixed now.

  • Learn about pull requests

Comments (30)

  1. Matthew Turk

    Wow, super cool! I'm going to review this and get back to you. It may be that we could also move the backup file into the base data structures, and if it's there, just use it. What do you think?

  2. John ZuHone

    Andrew,

    First off, this is great! I hope to check this out sometime tonight or tomorrow.

    Just a quick question: how does this handle particle fields?

    1. Andrew Myers author

      Hi John,

      At the moment, it doesn't, but the gdf writer does support writing out particle fields, and I think I could make it work by being slightly more sophisticated in the _read_data_set method.

      1. John ZuHone

        Once I get the chance to test it (sorry, access to email but not code), I can accept as-is and try adding in the particle support later, as I've been messing around with them quite a bit lately. But if you want to give it a try go ahead.

        Typically this has happened by checking the shape of the field, to see if it is 1-D or 3-D.

  3. John ZuHone

    Something odd is happening when I try this. I added a simple field:

    def _Entropy(field, data) :
        return data["Temperature"]*data["Density"]**(-2./3.)
    add_field("Entr", function=_Entropy)
    

    I open a pf, then try to save the field:

    save_field(pf, "Entr")
    

    When it tries to save the field to disk, it has to get the field, so it uses the grid's "get_data" method. Since the field is composed of two fields that are actually in the file, yt calls the the I/O handler, which now bails: KeyError: "unable to open object (Symbol table: Can't open object)"

    This was on a FLASH dataset. Funky things seem to happen when two file handles are open...

    1. Andrew Myers author

      Hi John,

      Thanks for testing this. It turns out that in my last update, I missed a couple of places where the grid._id_offset fix needed to be made, which ended up breaking the pull request once that fix was made to the gdf writer.

      While I was fixing that, I also found another problem with saving a field to a _backup.gdf file that already existed. That is fixed now too. The scripts here and here now work for me for flash, orion, chombo, and enzo datasets, after deleting any corrupted _backup.gdf files.

  4. John ZuHone

    So those scripts work for me now, but attempting to make a SlicePlot of FLASH data did not. In this case, it's because the implementation of _read_data_slice that the FLASH frontend has is completely independent of _read_data. Matthew Turk, would rewriting _read_data_slice here to use _read_data have any undesirable side effect that you can think of?

    1. Andrew Myers author

      Yeah, I see what you mean. I had fooled myself into thinking that slices worked, since some of the frontends do just call _read_data_set and then slice the output. Clearly this needs a bit more work...

  5. Matthew Turk

    Hi all, I hope to provide a review of this soon -- sorry I haven't yet. John ZuHone , to quickly answer your question, I no longer believe that _read_data_slice is efficient, especially for large quantities of data where the amount of data skipped is small. For FLASH it's probably faster to not use _read_data_slice, since what we're doing is trading the cost of constructing a (relatively expensive) python wrapper to an H5 data space versus reading the other 7 rows of data and discarding them.

      1. Matthew Turk

        I would be fine with that: it is gone in 3.0 anyway. The only place I could see this being a big performance hit is for situations where you have a large grid that is being sliced; say, 128^3 or 512^3. But I'm not sure. I have to think on this. For FLASH, _read_data_slice is almost certainly worse than throwing away the data that's over-read.

  6. Matthew Turk

    Options moving forward:

    • Refactor _read_data_slice for those frontends that use it. (Enzo, GDF, Athena, Stream.)
    • Remove _read_data_slice

    Remove FLASH's use of _read_data_slice

  7. Matthew Turk

    Hi Andrew, I just realized that this might end up with a medium-to-big performance regression. Can we have it so that if backup_filename is None, we don't call _field_in_backup? Otherwise we're constantly hitting the disk. And just set backup_filename to None if it doesn't exist?

    1. Andrew Myers author

      Hi Matt,

      Good point. This is fixed in the most recent update. It should now only check for the presence of a backup file once, when you load the pf.