Training steps take variable amount of time

Issue #193 new
andrew_peterson repo owner created an issue

I have noticed on numerous occasions that if you examine the time stamps in the training log files that training steps often take (fairly drastically) different amounts of time. Since there's no internal linesearch, I think the mathematical operations that take place within any step should be the same as any other -- so if we can figure out the delay in the slow steps, I suspect we can speed things up.

(I'm getting this on our list before the code sprint.)

Comments (2)

  1. andrew_peterson reporter

    I think I somewhat understand this now -- at least in the tests I ran this is surprisingly a scipy issue, inside fmin_bfgs.

    I found a system where it would push out 2-3 steps with <1 sec per step, then it would be 20 seconds/step after that (with an occasional <1 sec step). By starting with a checkpoint, I made the behavior repeatable.

    I put little log statements with timers in my calls to get_loss and get_lossprime, and found those were not variable, but watching the log file it would sometimes wait before starting one of those calls. If I hit ctrl-c during one of those 20-second breaks, I would inevitably get:

    $ python better.py 
    ^CTraceback (most recent call last):
      File "better.py", line 30, in <module>
        calc.train(images='../../training.traj')
      File "/home/aap/Dropbox/repositories/Amp/amp/amp/__init__.py", line 371, in train
        parallel=self._parallel)
      File "/home/aap/Dropbox/repositories/Amp/amp/amp/model/neuralnetwork.py", line 225, in fit
        result = self.regressor.regress(model=self, log=log)
      File "/home/aap/Dropbox/repositories/Amp/amp/amp/regression/__init__.py", line 89, in regress
        **self.optimizer_kwargs)
      File "/usr/lib/python2.7/dist-packages/scipy/optimize/optimize.py", line 859, in fmin_bfgs
        res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)
      File "/usr/lib/python2.7/dist-packages/scipy/optimize/optimize.py", line 975, in _minimize_bfgs
        Hk = numpy.dot(A1, numpy.dot(Hk, A2)) + (rhok * sk[:, numpy.newaxis] *
    KeyboardInterrupt
    

    So it seems to be something in the linear algebra; my guess is the Hessian calculation. This might mean the limited memory version of the optimizer would work better. I tried the other optimizers, and they are broken as noted in Issue #213.

    After we fix #213, we should probably figure out a better default optimizer to take care of this problem.

  2. andrew_peterson reporter

    I just installed a new scipy (1.1.0 instead of 0.19.1) on my local machine and the problem essentially went away; so it seems scipy has fixed something upstream. Should we require scipy>1.1.0?

  3. Log in to comment