Very large .out.nc files expensive to read or transfer -> restructure the netcdf output

Issue #197 new
Michael Hardman created an issue

When running large simulations, we (myself, Juan Ruiz Ruiz) typically create very large .out.nc files with a vast range of possibly interesting data, from geometrical data, to fluxes with t, t,kx,ky, to full fluctuation data as a function of all spatial variables. When we perform very large simulations over multiple scales (resulting in 100GB+ output files), we observe that this data takes a long time to transfer between machines for analysis, and also expensive to read with a simple python script, due to the large amounts of data in the file.

It would be convenient to give the user the option to store inexpensive data in a separate netcdf file to the expensive fluctuation data. This is especially pertinent in very large simulations, which we might be able to only perform once or twice, and therefore we might be tempted to write out the fluctuation data even if we are not sure that we will need it for analysis. This option would allow rapid analysis of basic diagnostic data, whilst still giving the option to speculatively store more expensive fluctuation data.

Would it be possible to consider such an option?

@David Dickinson

Comments (4)

  1. Peter Hill

    Which GS2 version are you using, and what python library are using to read the files? GS2 8.1.2 should use netCDF-4 which is backed by HDF5. This enables “lazy loading”, depending on the python library, so that you only actually read in (the parts of) the variables you actually use. This obviously doesn’t help with the transferring large files, of course. xarray is one Python library that definitely supports lazy loading. There’s also libraries like dask which can do lazy and/or parallel computation on netCDF files too.

    So one option for files you already have, is to use xarray to read just the bits you want, and write them to a separate file which you could then transfer much faster.

  2. David Dickinson

    Most systems which provide ncdump should also provide nccopy which allows one to create a copy of existing netcdf files whilst changing properties. This includes only copying certain variables.

    For example nccopy -V t,phi2 input.out.nc reduced.out.nc will create reduced.out.nc which contains all dimensions and attributes from input.out.nc but only the t and phi2 variables. It is also possible to ask nccopy to use compression in the resulting file with the -d flag. This takes an integer from 0 to 9, with 0 being no compression and 9 being maximal compression.

    In case the font is unclear, this is an uppercase V. Lower case v has a similar effect but retains other variable definitions, whilst not specifying data (so still saves space).

  3. Michael Hardman reporter

    Thank you for laying out the options above. I have not yet had time to try out these suggestions, but they look promising. I shall have to consider how to modify my (now rather complicated) scripts. I will reply again as soon as I have more information.

  4. Log in to comment