Optimization failure for anything other than BFGS algorithm

Issue #213 new
mwitman1 created an issue

Hi All,

I've recently been trying to interface Amp with the scipy implementation of Levenberg-Marquardt. While this required refactoring calculate_loss() (see Issue #134), I was still running into problems where the optimization would just terminate after one or two iterations without any improvement to the loss function when running in a parallel environment. Then I noticed that, if I change the optimization method to anything other than BFGS in regression/init.py, the optimization also terminates after one iteration. Inspecting the log file I can see:

...maximum absolute value of loss prime: 0.000e+00

which basically means Amp is not computing the dloss_dparameters.

After doing a bit of debugging, I believe the issue should come down to how the regressor master is set up to interact with the get_loss functions. In scipy's BFGS, the derivative of the loss function aka model.get_lossprime() is evaluated first. In this case, lossprime = True and parallel environment is set up as such. However in non-BFGS methods, the loss function aka model.get_loss() is evaluated first. In this case lossprime=False and the parallel environment is set up with lossprime=False which remains unchanged even in any subsequent calls to model.get_lossprime(). Therefore it would appear as though dloss_dparameters is never actually computed. I've also asked Efrem and he's observing the same line in the log file when using anything other than BFGS.

That's a pretty succinct description which I'd be happy to discuss further, but if I'm understanding the issue correctly the fix shouldn't be too difficult hopefully. To fix this I think it makes sense to refactor the way that Amp interfaces with scipy using scipy.optimize.minimize (same conclusion as Issue #134) into a more unified approach. Using scipy.optimize.minimize(...., jac=self.lossprime,...) will then control whether dloss_dparameters is calculated and eliminate the duplicate calls to get_loss in each iteration. Then get_loss() in neuralnetworks.py returns (loss, dloss_dparameters) if jac=True, and just returns loss if jac=False.

If you guys would like I can put in a pull request with how I would go about fixing this and see if you like it and want to incorporate it.

Thanks, Matt

Comments (4)

  1. andrew_peterson repo owner

    Also, I've noticed that sometimes (perhaps oftentimes) the internal updating of the Hessian can be the rate-limiting step in fitting NN parameters. I have monitored some jobs and will see that Amp is calculating loss function components very quickly (e.g., < 1 sec), then waiting 20 sec for scipy to update the Hessian. This is sped up dramatically by updating scipy from version 0.19.1 to version 1.1.0. However, even in 1.1.0, the update time might be reduced from 20 sec to 1 sec, and this still can limit the process. I haven't tested this exhaustively, but I expect this is more of a problem for large neural nets (lots of free parameters), therefore leading to a large Hessian.

    On one system, I saw the fitting time go from > 1 hour with scipy 0.19.1 to ~3 min with scipy 1.0.1.

    Solving the above issue should allow us to more easily switch optimizers and avoid such problems. (Perhaps this is part of the benefit of LM.)

    This also hints that we might want to add more internal timers to the code to diagnose where it spends its time.

  2. mwitman1 reporter

    Interesting and good to know. Based on what you've seen maybe it makes sense to just require 1.1.0 as you pointed out in #193. Spending time to diagnose the scipy algo performance on these NN problems definitely sounds like a good idea, as scipy is pretty black box and the custom implementations from the literature may have features not immediately evident from the scipy docs.

    LM will still require require some expensive matrix operations if the least squares Jacobian (# of parameters x # of residuals) is large, but it will be good to test out.

    Going tangentially off-topic in case anyone is interested:

    (1) I find it cool that even very recently people are proposing improved LM variants: https://arxiv.org/pdf/1201.5885.pdf

    (2) There's also been a lot of theoretical work recently on the potential benefit of using good 2nd order optimizers due to the proliferation of saddle points in the loss surface for high-dimensional NN's. While LM doesn't use the Hessian directly it still operates with a trust-region approach by approximating it, so a worthwhile thing to think about in these high dimensional NNs for chemical systems: https://arxiv.org/abs/1406.2572 and http://arxiv.org/abs/1712.07296

    Brings me to my last point, would it be interesting/feasible to take the second derivative of the E_loss, F_loss wrt parameters? This could open the door to some advanced 2nd order optimizers like the "trust-*" in scipy. Not sure about the added computational cost but could be interesting since they seem to be rarely used even in ML community and may help overcome pathological features in our high-dimensional loss surfaces and get to much better RMSE of forces.

  3. andrew_peterson repo owner

    Ah, yes, I knew I had written that optimizer stuff once before! Thanks for linking them.

    As for another level of derivatives, perhaps @akhorshi could comment?

  4. Alireza Khorshidi

    Sorry, just saw this interesting discussion going on.

    I confess when we wrote the optimizer part of the code, we had pretty limited optimizers in mind and among them BFGS was outperforming, so the code was written mostly fit to that optimizer. With the new suggestions for optimizer (and I believe in the future better-performing optimization approaches will definitely come to existence), it is a good idea to reformat the code to include the new methods and approaches. So Matt (@mwitman1) you are very welcome to put the code in a more general framework as you desire. Please go ahead!

    As for the feasibility of calculating 2nd derivatives, if we have 100 parameters, then the number of 1st derivatives is 100 but the number of 2nd derivatives is around 5000, 50 times more. If we have 1000 parameters, then 500 times more. Moreover if we assume calculating a single 1st-derivative is two times faster than calculating a single 2nd-derivative, then for n parameters 2nd-derivatives are n times slower.

    But if we see a 2nd-order optimizer fits better than a 1st-order optimizer, we can consider 2nd-derivatives. I suggest in that case trying numerical 2nd-derivatives first.

  5. Log in to comment