change default of IOUtil::out_save_parameters to true

Create issue
Issue #1565 closed
Roland Haas created an issue

it would be useful of HDF5 files contained the full set of parameter values. This can be used by postprocessing scripts to find out eg about the symmetry conditions or some of the grid structure parameters as well as atmosphere settings.

IOUtil::out_save_parameters controls whether the full set of parameters or only those that have been steered are written to file (for regular output, checkpoints always write all parameters).

I would like to change the default so that all parameters are always written. If feasible I would even want to write all parameter changes to file (either as full parameter dumps for each output or as a full dump when the simulation starts then deltas afterwards).

Keyword:

Comments (20)

  1. Erik Schnetter
    • removed comment

    Would this mean that all HDF5 output files always contain all parameter settings? Would this be useful?

    Would it suffice to store the parameters just once, to one file, at the beginning of the simulation, and then output deltas afterwards?

    Would it help to have some discipline (how to enforce this?) to ensure that these parameter descriptions are always saved when one saves or copies HDF5 files?

    Do you need other information in addition to the parameter values, maybe information not stored in parameters? Would it make sense to output these as well? As HDF5 files are flexible, we could add such information (e.g. performance data, memory consumption, ...)

  2. Roland Haas reporter
    • removed comment

    The hdf5 files (checkpoint as well as regular hdf5 output) already all contain a dataset "/Parameters and Global Attributes/All Parameters". For example the yt reader would use this information to present it to the user or to construct units or the grid structure. It may also be useful to obtain the equation of state information that way or minimum temperature or atmosphere density. Right now only some information is present in the hdf5 files (namely only those parameters that were steered) by default. I am not sure if Ian's Mathematica based postprocessing system parses parfiles or not. If it does it may benefit from being able to parse the better formatted "/Parameters and Global Attributes/All Parameters" dataset instead (since it already has all substitutions performed, comments removed, special characters escaped).

    Right now the parameters dataset is output each time a dataset is written to the file, overwriting the previous "/Parameters and Global Attributes/All Parameters" dataset. It would be better (as indicated in the ticket) to actually save the history of parameter changes.

    I do not know on top of my head if (or which) of our hdf5 manipulation tools copy this dataset. I would expect that hdf5_extract likely does (since it copies datasets recursively) but eg hdf5_slicer may not.

    All the information that characterizes a run should (I think) be in the parameters (since they are the only thing that initially describes the simulation to the code). Tabulated initial data and equation of state tables are separate files but the file names are mentioned in the parfiles.

    The HTTP thorn can steer parameters based on user input which would show up as a regular parameter change.

    It may not be possible to exactly reproduce the state of the parameters at each moment in the simulation if eg a parameter is steered more than once during the same iteration or steered (for identical iteration number) after some refinement levels are already processed. This situation is similar to other Cactus output however.

    We will need some discipline to prevent users from using parameters as evolving quantities since with these changes a parameter that is steered at each iteration effectively becomes a grid scalar.

  3. Barry Wardell
    • removed comment

    Replying to [comment:3 rhaas]:

    I am not sure if Ian's Mathematica based postprocessing system parses parfiles or not. If it does it may benefit from being able to parse the better formatted "/Parameters and Global Attributes/All Parameters" dataset instead (since it already has all substitutions performed, comments removed, special characters escaped).

    SimulationTools currently parses the parameter file to infer a lot of required information about a simulation. It doesn't look at the "/Parameters and Global Attributes/All Parameters" dataset at all. It would certainly be nice to have a more robust system, where all metadata related to a simulation is stored in a standard and easily parseable way (such a system was proposed in #1370 - there is a distinction between metadata and parameters, but I assume it is the former that Roland would really like).

    I am not convinced that storing parameters as a dataset in each HDF5 file is the right way to go. Would it not be better to have a single place where all simulation metadata is stored?

  4. Ian Hinder
    • removed comment

    One thing I have been considering recently is that we could set the parameter IO::parfile_write = "generate", which I believe causes the parameter file written into the simulation output directory to includes all current parameters and their values, not just those explicitly set by the user. This would allow SimulationTools to obtain the actual values of various parameters, without having to have knowledge of their defaults. This would be stop-gap measure until there is a proper metadata file written by Cactus. SimulationTools would still have to interpret the parameters, attempting to make sure that it has the same logic as Cactus, so a proper "low-level" metadata file generated by Cactus would still be superior.

    SimulationTools works with entire simulations, so it doesn't look at the parameters stored in HDF5 files, as it looks instead at the parameter file. Probably most analysis tools deal with gridfunctions directly, without needing them to be associated with a simulation. In that case, some of the information which is currently only available in parameters might be useful. On the other hand, I think it only makes sense to store information in the HDF5 file which pertains to the variable being stored. Things like atmosphere settings are properties of the simulation, not the variable, so if you need to know about that, then you should accept that you should have been dealing with simulations, not just variables. Symmetry conditions apply directly to the variable, as they tell you how the full physical variable can be reconstructed from the data, so they should be included with the variable in the HDF5 file. However, I don't think that parameters are the best way to go about this. Instead, I would add additional metadata as attributes to the datasets, or in a separate dataset just for this information. Parameters are designed for the user to tell Cactus what to do, and there may be several different ways of specifying the same information (see, eg. CoordBase). You don't want to have to interpret those parameters; you want the symmetry thorns to tell you exactly what they have done.

    As a stop-gap measure however, including all parameters in the HDF5 file would probably solve your immediate problem. I assume it would not make the HDF5 file too large, so I don't see any harm in changing the default. Changing the default for the parameter file written to the output directory would help SimulationTools, but would then mean that generated parameter files would be unrecognisable, so it's probably not a good idea to force that on everyone by changing the default. I will probably start using this in my own simulations though, as I use simfactory, and hence have the original parameter file available anyway.

  5. Wolfgang Kastaun
    • removed comment

    I maintain a python based postprocessing framework which currently parses the parameter files into a python object so I can write something like par.coordbase.dx in my scripts. However, this will become increasingly difficult now that parfiles are slowly turned into some sort of turing-incomplete programming language. Also, I cannot access default values of parameters not in the parfile.

    For those reasons, I would also appreciate a hdf5 file with all parameters in it. This could either be a separate file or in the datafiles. Preferably, each parameter could be an hdf5 attribute. It seems currently, "/Parameters and Global Attributes/All Parameters" is just one long string in parfile syntax, with quotation marks around strings removed, and $parfile etc already expanded. I am however confused about what parameters are included there. I compared for one of my simulations to the original parfile. Some parameters in the original file are not in the .h5, others not in the original are in the .h5. What is the rule here?

  6. Erik Schnetter
    • removed comment

    Thorn Formaline's design goal is to store sufficient information about simulations to make them reproducible. Most people know that it stores the complete source code by default. It also stores all parameter values, as well as regular updates if parameters ares steered. It uses a simple "KEY=VALUE" format for this, but can also generate several other formats (XML, or submit to a web server in real-time so that simulation progress can be more easily observed).

    I'd be happy to modify the output format to make it easier to parse.

    Is there particular value in having the parameters stored, as string, in each HDF5 file, or would an external metadata file be sufficient?

  7. Ian Hinder
    • removed comment

    Thanks for the pointer Erik! So, it turns out that Formaline by default outputs a lot of metadata to a file called formaline-jar.txt. In #1370, I suggested that Formaline might be used for the metadata framework, but I cautioned that this then requires people to activate it, and it might be better to have such an important feature as simulation metadata handled by the flesh so it was available in absolutely all simulation output. Formaline also, as far as I can tell, does not store "processed" metadata. I would like an API for thorns to be able to register their own metadata with this system. For example, Carpet might provide information about the actual grid structure, rather than the user having to parse this from the parameter file. Erik, would such a thing be possible?

    I would very much like these features to be available for every simulation, and people tend to disable Formaline due to the time it takes during building, and the size of the source tarballs. Having SimulationTools relying on the user having activated Formaline might cause it not to work in many cases. That's why I would prefer a mandatory flesh-based mechanism.

  8. Erik Schnetter
    • removed comment

    Of course such an API would be possible.

    Since storing/outputting metadata requires I/O etc., I don't think it should be in the flesh; this may require e.g. XML support in the flesh etc, which can get large and complex. If you want to force people to use this infrastructure, then there are other mechanisms, e.g. the flesh could inherit from a thorn -- this would also allow people to replace the thorn with their own.

    I would very much like to "force" people to use Formaline, because I am convinced it is a good thing. However, people think differently; they may get annoyed by its overhead for a feature set they don't need or understand. I thus refrain from "forcing" people, and instead try to reduce the overhead and convince people in discussions... I've seen people delete core files before asking me for help debug their crashes (because they are so large!). One cannot force people; one can only explain how some feature would help them, and try hard to ensure that the first thing people encounter isn't the overhead.

  9. Erik Schnetter
    • removed comment

    One way of reducing Formaline's overhead would be to attach the tarballs to individual thorns, not to the executable. This way, only those thorns that are changed need to be touched, and the tarballs are built earlier in the build process and thus can overlap with compiler invokations of other thorns.

  10. Frank Löffler
    • removed comment

    Such an API would be a good idea. The API could live in the flesh, the actual implementation would be in a thorn /thorns - much like any other IO thorn as well. Whether the metadata is then output as xml, hdf5, ascii or blue cheese would be up to thorns implementing this.

    For the parameters I would prefer the solution of writing these into a separate file, with a good default name. There is really no need to have these in every hdf5 file.

  11. Ian Hinder
    • removed comment

    If there were negligible overhead, I think people would be more likely to use it. Unfortunately, the overhead is sufficiently large that I frequently take it out of a thornlist that is taking a long time to compile on a slow filesystem, or exclude the source tarballs when I sync data, as for small simulations they are much larger than the actual data. I have thought a lot about a way to store the source information using version control IDs etc, but the main problem there is to know what is a "well-known" repository and ID, such that it will be reproducible in 5 years.

    I don't understand your statement about attaching the thorn tarballs to thorns. Do you mean the thorn libraries? As I understand it, these are typically statically linked into the executable. So the source tarballs would end up in the executable anyway, and have to be included at link time. I will open a new ticket about reducing the overhead of Formaline.

  12. Roland Haas reporter
    • removed comment

    Wolfgang: the parameters stored in "/Parameters and Global Attributes/All Parameters" are those that Cactus detects as having been set (including the "setting" of parameters from checkpoints it seems). The attached short python script parses the dataset and outputs all information to screen (the actual parsing are the 6 lines

        pairs = dset[()].splitlines()
        for pair in pairs:
            m = re.match("^([^ ]*) = (.*)$", pair)
            if(not m):
                raise ValueError("invalid paramter setting '%s'" % pair)
            parvals[m.group(1)] = DecodeString(m.group(2))
    

    You can already get the full set of parameters with the current code by setting IOUtil::out_save_parameters = "all" explicitly. Making them all attributes would likely be possible though I am not sure how much more useful that would be. One would then likely want to use scalars int/double values for int/real params, int for booleans and some string type for strings and keywords. Arrays of parameters should then appear as arrays which for the strings requires variable length HDF5 strings which are not very simple to handle (in C/C++. Python hides this). Most of the code to handle this is likely somewhere in CarpetIOHDF5 but it would require some effort to get eg the mapping of CCTK_REAL to double or float based on compile time options right.

    Ian: currently IO::parfile_write = "generate" only writes parameters that have been set, so it is not any better than the current information in hdf5 files. It also writes the parameters only at the beginning (ie before anything is steered) which is less useful (though not worse than hdf5 files which only retain the last set of parameters). There's in principle the parameter 'parfile_update_every' which would Cactus regenerate the parfile every so often however that is currently not implemented (see IOUtil's Startup.c). Concerning having "cooked" ie processed from parameters metadata in files, there is the issue of only storing information once since otherwise I very much suspected that the different pieces of information will contradict each other. Maybe not much of an issue for postprocessing but I would try to avoid having the actual simulation rely on this postprocessed data. Ie. if you need to know if a symmetry is active, you should either consult the parameters or call a function to query the symmetry state. One should not, I believe query some generic metadata collection for this information since it will very likely become out of sync with whatever information was initially used to compute it.

    Erik: The idea of including this information in the hdf5 files is to have them as self contained as possible. Formaline is certainly the best place to collect the metadata since it is designed for this. Possibly the flesh could define an API to register the metadata and the let formaline take care of generating and saving it. This seems to be the general Cactus approach the the flesh only defines an API and then the thorns implement it. Parsing formaline-jar.txt to get out the information seems to be non-trivial (more than 6 lines). In order to detect the end-of-value for multi-line string parameter values I seem to need to keep track of double quotes to remember when I am inside of a double quoted parameter. The format in the hdf5 files where each parameter value is a single line seems easier to parse for a machine (and harder for a human).

    This discussion seems to have gotten a little bit off-track though. I was really only asking to change the default value for an existing parameter from its current default ("only set") to another of its currently implemented values ("all") since the current default is not very useful. I did not want to enter into the more general discussion of what metadata to include and in which format to provide it. What I propose can be implemented in 5 minutes (change of default value) and is immediately useful, while the current situation ''still'' writes the dataset to ''all'' hdf5 files and the dataset is almost never useful.

  13. Ian Hinder
    • removed comment

    Replying to [comment:7 eschnett]:

    Thorn Formaline's design goal is to store sufficient information about simulations to make them reproducible. Most people know that it stores the complete source code by default. It also stores all parameter values, as well as regular updates if parameters ares steered. It uses a simple "KEY=VALUE" format for this, but can also generate several other formats (XML, or submit to a web server in real-time so that simulation progress can be more easily observed).

    I'd be happy to modify the output format to make it easier to parse.

    I propose that you use the "ini" file format. It is easy to parse this in Python, and hence from a shell script:

    #!/bin/bash
    
    function read-key()
    {
        python -c "import ConfigParser; import sys; config = ConfigParser.ConfigParser(); config.read(sys.argv[1]); print config.get(sys.argv[2],sys.argv[3])" "$1" "$2" "$3"
    }
    
    function read-sections()
    {
        python -c "import ConfigParser; import sys; config = ConfigParser.ConfigParser(); config.read(sys.argv[1])
    for s in config.sections():
      print s" "$1"
    }
    
    read_sections "file.ini"
    read_key "file.ini" "sectionname" "keyname"
    

    It is a (relatively) standard format, and there should be similar libraries for whatever analysis framework one might be using. We might have to tweak things like quoting and multi-element values to work with different readers, but I think this is still preferable to using a new format.

  14. Erik Schnetter
    • removed comment

    Roland: We tried storing parameters as attributes, and found that this has a large size and time overhead. This was unacceptable, and we thus had to revert this change. We also used to output much more information about the grid structure, but also had to remove this output again, for the same reasons.

    Mapping to the respective HDF5 types is trivial; the respective functionality exists in Carpet, based on C++ templates.

    We can certainly change the format of multi-line strings in Formaline. The current format is similar to the one used in Cactus parameter files.

    Cactus generates parameter files in the output directory. If there is no option to generate all parameter values (including the defaults), then we should add this. I would like to add that having the original parameter file around is also important, since it often contains comments that are useful to know later on, even if they are not usable in visualization scripts.

    Instead of parsing parameters, would it make sense to output more information regarding the grid structure? The grid structure should be easier to parse (or can be modified to be so), and is the only authoritative information anyway. The influence of parameters on run-time decisions is often very indirect, e.g. they may be ignored after recovering. In particular for post-processing or visualization, I think you should be looking at other sources of information, and we should add these to make this possible.

    Ian: The config-format does not easily handle multi-line strings, or strings with beginning/trailing white space. I designed the parameter storage format to be similar to Cactus parameter files. If we want to be able to use the config-file format, then we should probably at the same time design a parameter file format for it. I am not particularly fond of multi-line strings or leading/trailing white space, so we may just do away with them at the same time, and design better mechanisms to handle long lists (of thorns or variables) in parameter files.

  15. Roland Haas reporter
    • removed comment

    If we are willing to re-design the parfile syntax, we may want to consider using libconfig http://www.hyperrealm.com/libconfig/ (example see here: http://www.hyperrealm.com/libconfig/test.cfg.txt) which offers well structured, nested configuration files with build in support for int, double, string and list-of-type. It also keeps track of which line and file each setting comes from and allows the in-memory structure representing the settings to be modified.

    If we want to retain the ability to do simply arithmetic in parfiles (which I like) then we'd have to declare most of our types as "string" as far as the library is concerned. This currently requires double quotes but can likely be changed since the end-of-value is indicated by a semicolon ";" anyway.

  16. Barry Wardell
    • removed comment

    Since we're proposing formats, I'd like to throw JSON http://www.json.org/ into the mix. It is lightweight, easy for humans to read and write and easy for machines to parse and generate.

  17. Roland Haas reporter
    • changed status to resolved
    • removed comment

    After discussion in todays call: it seems better to provide this information as HDF5 datasets/attributes rather than re-parsing the parameters which are non-unique.

    Roland will contact the yt developers to ask exactly what data they require and propose a scheme to add them the files.

  18. Log in to comment