Speed it up!

Issue #1 new
Tim Cera repo owner created an issue

Don't know exactly why it is so sloooooow. When moving through the file I do use seek often - but always moving forward in the file.

Comments (4)

  1. David Lampert

    Tim, thanks for this tool. I parsed through it and sort-of re-wrote the "_get_data" method for my own purposes. Anyway, after I did that it took about 2/3 as long to read the file as before. However, I don't think the bottleneck has anything to do with that method in your code.

    I am not really interested in creating pandas dataframes or printing to screen, so I can get the info from the file using my own get_data method now. but I went ahead and looked into the slowness of the code and this is what I found. First, I didn't get any matches using the "extract" method. I am not sure if that is because I don't pass arguments right or there is a problem. I tried to read the "base.hbn" file from BASINS using the following commands

    import hspfbintoolbox

    data = hspfbintoolbox.extract('base.hbn', 'BIVL')

    and got the message that "The label specifications matched no records in the file."

    So I installed pandas and tried looking at the "dump" method.

    You are creating new Pandas dataframes in every iteration of the loop on the "skeys" list. Collecting memory for those dataframes takes time. Then you are "joining" them, which takes even more time--each time you do that Python has to go get more memory for a new dataframe, then copy the info to the new dataframe. this gets progressively slower as you are copying more and more info.

    I would guess that if you created a big dataframe before looping and just filled the values things would speed up. Pre-allocate the memory. I don't have time to look into how to do this with pandas since at this point I don't use it. But I suspect this (memory pre-alloation) will speed things up a bunch.

  2. Tim Cera reporter

    Using 'hspfbintoolbox catalog base.hbn' shows only yearly time-series, no BIVL.

    Now, I thought that the following would work...

    from hspfbintoolbox import hspfbintoolbox
    data = hspfbintoolbox.extract('base.hbn', 'yearly')
    

    but doesn't - which is something that I should fix. You actually have to explicitly include the label so the following does work (the label ',,,' is equivalent to the dump command):

    data = hspfbintoolbox.extract('base.hbn', 'yearly', ',,,')
    

    So that should get extract working for you.

    You can have multiple labels instead of using dump. For example, to extract AGWI and IGWI time-series from all PERLNDS:

    data = hspfbintoolbox.extract('base.hbn', 'yearly', ',,,AGWI', ',,,IGWI')
    

    For the speed issues, thank you for taking a look. The only way that I can think of to pre-allocate the Pandas dataframe would be to read through the datafile once to count the number of matches to the label(s). I don't know whether that is more expensive than not, but it might depend on the number of matches. Would be interesting to plot out number of matches to the time required. Definitely worth a look through the Pandas feature set to see if it is dooable. How did you get the improvement that you did when you re-wrote _get_data? Even that would be nice.

    Kindest regards, Tim

  3. David Lampert

    i believe you can just copy this to a file and try to run it.

    https://github.com/djlampert/PyHSPF/blob/master/src/pyhspf/core/hbnreader.py

    but the result is a series of dictionaries. i think it could be simplified even more, but not sure it's worth the effort. if i were trying to do what you've done, i would start with this (built in) Python structure and then figure out how to use it to make a dataframe, csv, whatever based on the user query. the hbn file it reads is here:

    https://github.com/djlampert/PyHSPF/blob/master/examples/data/base.hbn

    this file came with the last version of BASINS that i used. it has a little over 2000 data records i think.

    in any case, using your _get_data method and this reader, it took 0.03 s and 0.02 s to read the records, respectively (all the file). so that is not the bottleneck. the difference is just the algorithm i used to generate the inidices of the PERLND, IMPLND, RCHRES byte strings with the yield statement i think.

    it took about 3 minutes to read the file using the "dump" method. i did some time analysis and it was taking 0.01 s to make each new dataframe, then joining the frame with the other took 0.01 s initially, but rose to over 0.1 s later on as the new memory requirements increased (as i described above). HTH.

    Dave

  4. Log in to comment