Harvesting scripts speed up

Issue #40 resolved
dsmic created an issue

I am running harvesting with 10000 games from 15-3 size right now. After 24 hours I am down to size 6, will take another 12 hours to finish.

More than 90% of the processor time is used for loading the gammas by oakfoam.

Would be great, it the gammas would only be loaded once for a size, and all games run through one oakfoam instance then. (Simelar to the solution for training gammas)

This would make my parallel changes unnecessary I think.

Comments (13)

  1. dsmic reporter

    There is another problem with the scripts:

    The harvest-combine.sh script uses a lot of ram, it can not handle 16000 games on 8GB of ram. As I would like to be able to harvest >100000 games (this is what is availible from kgs stronger 6d) and 16GB of ram cost 150$:) a solution would be nice.

    I could look after this part, do you have time to do the scripting for gammas. I am so very bad with this scripts:(

  2. dsmic reporter

    ok, experience: harvesting 10000 games took 2 hours with this new harvest-collection-circular2.sh

    this is about 10 times faster as before, I did not end the last run stopped it after 30 hours with more than 10 hours left.

    How should I handle this. Savest way: I add

    harvest-collection-circular2.sh harvest-collection-circular-range2.sh

    to my repository.

  3. dsmic reporter

    I double checked: seems to give exactly the same result as the script before, if one sets the featurelist probability to 1.0

  4. Francois van Niekerk repo owner

    The reason the circular patterns are loaded for each game separately is because I found that when they the patterns were harvest with one process the output grew very big and the post-processing (sorting and 'uniq -c') would crash.

    I had a look at the attached file:

    • Lines 20-23 don't make sense there.
    • Lines 60-64 will probably take very long when a large number of patterns are found.

    How much RAM was used for the post-processing parts?

  5. dsmic reporter

    harvest-collection-cirular2.sh never used significant amount of RAM. I did some googleing, and it seems the unix tools sort and uniq use temp files in an intelligent manner. But of cause you are right, it takes some time.

    I had no crash at all. I will keep you tuned. In the original script harvest-combine.sh used a lot of ram and crashed (8GB in the awk process when harvesting 16000 games)

  6. Francois van Niekerk repo owner

    You say it takes "some time". How long are we talking? I recall now that the time it took was very long and provided zero feedback until it either completed or crashed.

  7. dsmic reporter

    when harvesting 10000 games for sizes 15-3 it took about two hours. In this lines it was running about < 10 min for each size. If I remember correctly size 15 took 19 min in total.

    The speed up is from the smaller sizes, were loading gammas took >90% of time before.

    Detlef

  8. Francois van Niekerk repo owner

    My only issue is that 10 min without any feedback is not ideal. Then I am unsure if the script is working or has failed. I will think a bit if it is possible to somehow show progress while doing it in one step.

  9. dsmic reporter

    A little help would be to display a dot between, grep sed grep sort uniq sort

    so the dead time is 3 min perhaps, but will increase with higher game numbers of cause

    Detlef

  10. dsmic reporter

    A short note, if you have sort crashing it is probably due to tmp mounted as ram disk?! Disk space is usually not an issue...

    Detlef

  11. Log in to comment