Training steps take variable amount of time
Issue #193
new
I have noticed on numerous occasions that if you examine the time stamps in the training log files that training steps often take (fairly drastically) different amounts of time. Since there's no internal linesearch, I think the mathematical operations that take place within any step should be the same as any other -- so if we can figure out the delay in the slow steps, I suspect we can speed things up.
(I'm getting this on our list before the code sprint.)
Comments (2)
-
reporter -
reporter I just installed a new scipy (1.1.0 instead of 0.19.1) on my local machine and the problem essentially went away; so it seems scipy has fixed something upstream. Should we require scipy>1.1.0?
- Log in to comment
I think I somewhat understand this now -- at least in the tests I ran this is surprisingly a scipy issue, inside fmin_bfgs.
I found a system where it would push out 2-3 steps with <1 sec per step, then it would be 20 seconds/step after that (with an occasional <1 sec step). By starting with a checkpoint, I made the behavior repeatable.
I put little log statements with timers in my calls to get_loss and get_lossprime, and found those were not variable, but watching the log file it would sometimes wait before starting one of those calls. If I hit ctrl-c during one of those 20-second breaks, I would inevitably get:
So it seems to be something in the linear algebra; my guess is the Hessian calculation. This might mean the limited memory version of the optimizer would work better. I tried the other optimizers, and they are broken as noted in Issue #213.
After we fix #213, we should probably figure out a better default optimizer to take care of this problem.