Output for benchmarking data

Issue #128 new
Joseph Parker created an issue

Currently, when we run GS2’s benchmarking suite, make benchmarks, timing data is written in the format

<number of procs> <time for the benchmark>

appended into a file called

<run_name>.timing.<date>.<GK_SYSTEM>.<git_hash>

This is easy to work with as a format, except:

  • the runs aren’t sorted by processor count
  • having the date in the filename is really inconvenient if you gather data from a single run over multiple days
  • it seems odd to write most of the data into the filename rather than the file

Question Is there a more sensible way of presenting the run data? I think actually a filename like <run_name>.timing.<GK_SYSTEM>.<git_hash> is sensible, as I should never need to plot times for two different hashes or HPC systems as part of the same performance data set. Thoughts? (Possibly related to #94)

Comments (4)

  1. David Dickinson

    Yes I think removing date from the file name is sensible if possible. I guess this was done to allow reuse of the directory etc. without clobbering existing data (say if the changes are to do with testing a submission script, machine state, module versions etc.). I guess your current proposal would lead to clobbering in these instances or can the code append instead (and is this what we would want)?

    I think the best format probably depends somewhat on what we want to use this data for etc. I’d be tempted to put all the information into the file itself so that we could imagine just concatenating all the data together and then using something like pandas to pull out the slices we want (say filtering on machine/run_name etc.).

    Depending on how detailed we want to be we might also need to store things like compiler (+version), make flags used (e.g. debug or not) etc.

  2. Peter Hill

    I agree with David, having all the information in the file itself likely makes it much easier to consume the data. Moving the date from the filename into the file itself makes sense especially.

  3. Joseph Parker reporter

    Thanks both! David, I think the idea was to get different timing files for different machines/commits/days without clobbering existing data. Except, because different processor counts require different jobs, the data is always appended anyway.

    I imagine the typical use case will be wanting to plot time versus proc count. I was thinking using numpy.loadtxt rather than pandas, but it probably is better to put everything into the file itself in a csv-like way. The only issues with that are:

    1. the data that gets written will vary from one benchmark to another, so the writing routine will no longer be generic (or would need more thought)
    2. we change variables (like layout) by providing different input files, so at the moment, changing variables means changing the run name, and these get written to separate files anyway.

    On reflection, I think this does do want I want at the moment, except for writing the date into the filename. Might be one to revisit if/when we could pass input variables on the command line, and point 2 above no longer applies.

  4. Log in to comment