Training steps take variable amount of time

andrew_peterson reporter

I think I somewhat understand this now -- at least in the tests I ran this is surprisingly a scipy issue, inside fmin_bfgs.

I found a system where it would push out 2-3 steps with <1 sec per step, then it would be 20 seconds/step after that (with an occasional <1 sec step). By starting with a checkpoint, I made the behavior repeatable.

I put little log statements with timers in my calls to get_loss and get_lossprime, and found those were not variable, but watching the log file it would sometimes wait before starting one of those calls. If I hit ctrl-c during one of those 20-second breaks, I would inevitably get:

$ python better.py 
^CTraceback (most recent call last):
  File "better.py", line 30, in <module>
    calc.train(images='../../training.traj')
  File "/home/aap/Dropbox/repositories/Amp/amp/amp/__init__.py", line 371, in train
    parallel=self._parallel)
  File "/home/aap/Dropbox/repositories/Amp/amp/amp/model/neuralnetwork.py", line 225, in fit
    result = self.regressor.regress(model=self, log=log)
  File "/home/aap/Dropbox/repositories/Amp/amp/amp/regression/__init__.py", line 89, in regress
    **self.optimizer_kwargs)
  File "/usr/lib/python2.7/dist-packages/scipy/optimize/optimize.py", line 859, in fmin_bfgs
    res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)
  File "/usr/lib/python2.7/dist-packages/scipy/optimize/optimize.py", line 975, in _minimize_bfgs
    Hk = numpy.dot(A1, numpy.dot(Hk, A2)) + (rhok * sk[:, numpy.newaxis] *
KeyboardInterrupt

So it seems to be something in the linear algebra; my guess is the Hessian calculation. This might mean the limited memory version of the optimizer would work better. I tried the other optimizers, and they are broken as noted in Issue #213.

After we fix #213, we should probably figure out a better default optimizer to take care of this problem.

2018-07-26T18:06:13+00:00

Comments (2)

andrew_peterson reporter
I think I somewhat understand this now -- at least in the tests I ran this is surprisingly a scipy issue, inside fmin_bfgs.

I found a system where it would push out 2-3 steps with <1 sec per step, then it would be 20 seconds/step after that (with an occasional <1 sec step). By starting with a checkpoint, I made the behavior repeatable.

I put little log statements with timers in my calls to get_loss and get_lossprime, and found those were not variable, but watching the log file it would sometimes wait before starting one of those calls. If I hit ctrl-c during one of those 20-second breaks, I would inevitably get:
```
$ python better.py 
^CTraceback (most recent call last):
  File "better.py", line 30, in <module>
    calc.train(images='../../training.traj')
  File "/home/aap/Dropbox/repositories/Amp/amp/amp/__init__.py", line 371, in train
    parallel=self._parallel)
  File "/home/aap/Dropbox/repositories/Amp/amp/amp/model/neuralnetwork.py", line 225, in fit
    result = self.regressor.regress(model=self, log=log)
  File "/home/aap/Dropbox/repositories/Amp/amp/amp/regression/__init__.py", line 89, in regress
    **self.optimizer_kwargs)
  File "/usr/lib/python2.7/dist-packages/scipy/optimize/optimize.py", line 859, in fmin_bfgs
    res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)
  File "/usr/lib/python2.7/dist-packages/scipy/optimize/optimize.py", line 975, in _minimize_bfgs
    Hk = numpy.dot(A1, numpy.dot(Hk, A2)) + (rhok * sk[:, numpy.newaxis] *
KeyboardInterrupt
```
So it seems to be something in the linear algebra; my guess is the Hessian calculation. This might mean the limited memory version of the optimizer would work better. I tried the other optimizers, and they are broken as noted in Issue #213.

After we fix #213, we should probably figure out a better default optimizer to take care of this problem.
- 2018-07-26T18:06:13+00:00
andrew_peterson reporter
I just installed a new scipy (1.1.0 instead of 0.19.1) on my local machine and the problem essentially went away; so it seems scipy has fixed something upstream. Should we require scipy>1.1.0?
- 2018-07-26T18:13:19+00:00
Log in to comment