Option to run se_root and et_look in slices

Issue #18 open
Bich Tran created an issue

In case the project is too large (e.g., time: 203 days, ~600x1000 pixels), I have memory error even when I specify chunks. Even chunks = {'time':1, ‘x':500, ‘y’:500’) does not work with 16GB RAM. I also notice that in se_root model, there’re more ‘time’ steps since it’s instantaneous.

My workarounds are

(1) create and run projects with shorter periods then merge results. But this means, user needs to have some experience with their computational capacity. For example, after many trials, I managed to run 3 months maximum with 600x1000 size.

(2) In case the project has already finished downloading steps and complete se_root_in.nc file, I sliced the dataset into smaller datasets and run se_root and et_look. For example:

se_root_in = project.run_pre_se_root()
### RUN SE_ROOT in smaller files
dims = dict(se_root_in.sizes)
print(dims)

def generate_tuples(X, Y):
    tuples_list = []
    a = 0
    while True:
        b = a + 1
        next_tuple = (a * X, b * X)
        if next_tuple[1] > Y:
            break
        tuples_list.append(next_tuple)
        a += 1
    return tuples_list

X = 70 #size of time slice
Y = dims['time'] 
blocks = generate_tuples(X, Y)
destination_folder = os.path.join(project_folder,'se_root_out')
if not os.path.exists(destination_folder):
    os.makedirs(destination_folder)
source_file = os.path.join(project_folder,'se_root_out.nc')
for block in blocks:
    ds = se_root_in.isel(time=slice(block[0],block[1]))
    ds_out = pywapor.se_root.main(ds, se_root_version = "v3",                             
                                chunks = {"time": 1, "x": 632, "y": 550}
                                                )
    print(f'Finish block {block[0]} {block[1]}')
    destination_file = os.path.join(destination_folder,
                                    f'se_root_out_{block[0]}_{block[1]}.nc')
    shutil.move(source_file, destination_file)
    ds = None
    ds_out = None

Maybe it would be interesting to develop a pywapor module that can estimate machine RAM, divides and runs se_root and et_look in smaller batches like this.

Comments (2)

  1. bert.coerver

    I don’t think it's a good idea to divide your data in blocks like this, because this is exactly what the Dask/Xarray chunks are for (and supposed to be doing).

    So I would prefer to figure out why you are still getting MemoryErrors, even when using small chunks. I suggest to look at this tutorial explaining how you can monitor the Dask performance more in depth when running se_root/et_look. It also allows to limit the amount of RAM you want to use and the number of workers and threads (usefull to test if the code can also work on pc’s with less RAM). In short, you’d have to run this before importing pywapor (e.g.):

    from dask.distributed import Client
    client = Client(n_workers=2, threads_per_worker=2, memory_limit='1GB')
    

    Then you can inspect client to get a link to a page that opens in your browser with a lot of live stats about your processes.

    If you run the lines of se_root one by one (and call .compute()) each time, you could probably figure out where things get ugly. One possibility to then make the computation faster, is by adding one (or a few) intermediate points at which we save the results to a (temporary) netcdf file.

  2. Log in to comment