Reproducibility in training

Issue #103 new
andrew_peterson repo owner created an issue

It seems we have some issue with reproducibility during training. Below is a minimal script.

from ase.build import fcc111
from ase.visualize import view
from ase.calculators.emt import EMT
from ase.io import Trajectory

from amp import Amp
from amp.descriptor.gaussian import Gaussian
from amp.model.neuralnetwork import NeuralNetwork

# Create test data.
traj = Trajectory('train.traj', 'w')
atoms = fcc111('Cu', size=(2,2,1))
atoms.set_calculator(EMT())
atoms.get_potential_energy()
traj.write(atoms)
atoms[0].z += 0.5
atoms.get_potential_energy()
traj.write(atoms)
atoms[0].y += 0.3
atoms.get_potential_energy()
traj.write(atoms)
atoms[0].x += -0.33
atoms.get_potential_energy()
traj.write(atoms)

# Train a calculator.
calc = Amp(descriptor=Gaussian(), model=NeuralNetwork(),
           label='01', dblabel='amp')
calc.train('train.traj')

# Now re-train from this one.
calc = Amp.load('01-initial-parameters.amp', label='02', dblabel='amp')
calc.train('train.traj')

# One more time (to eliminate saving/loading data as cause).
calc = Amp.load('01-initial-parameters.amp', label='03', dblabel='amp')
calc.train('train.traj')

The results are of course different everytime this is run (due to a different initial parameters guess in the first calculator). But I am not even seeing a consistent number of steps to converge in each of these. Calc 01 took 1128 steps, 02 took 778, and 03 took 793. So it is not just a simple matter of losing some precision on writing / reading from disk.

What could be causing this? Is there some randomness in the scipy optimizer?

Comments (16)

  1. andrew_peterson reporter

    I tried the above script with cores=1. Now calc 02 and calc 03 appear identical. So perhaps there is some precision loss when passing numbers around with ZMQ? (Note calc 01 is still different than calc 02; this could be caused by precision loss when writing to disk.)

  2. andrew_peterson reporter

    We think we have worked it out. ZMQ's send_pyobj command first pickles and object before sending it. If the object is a numpy array, there is no precision loss (we think) as it is kept as a binary representation. However, if we pickle a float it is first converted to a base-10 string:

    >>> import pickle
    >>> import numpy
    >>> a = 1./3.
    >>> pickle.dumps(a)
    'F0.3333333333333333\n.'
    >>> pickle.dumps(numpy.array(a))
    "cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(tcnumpy\ndtype\np6\n(S'f8'\np7\nI0\nI1\ntp8\nRp9\n(I3\nS'<'\np10\nNNNI-1\nI-1\nI0\ntp11\nbI00\nS'UUUUUU\\xd5?'\np12\ntp13\nb."
    

    Further (worse?), we are using ase.calculators.calculator.Parameters.tostring at various points in the code, which allows quickly serializing a calculator. Inside ASE, it uses '%r' to represent items in the dictionary, which can lead to even fewer digits:

    >>> '%r' % a
    '0.3333333333333333'
    >>> '%r' % numpy.array(a)
    'array(0.3333333333333333)'
    >>> '%r' % numpy.array([a])
    'array([ 0.33333333])'
    

    If reproducibility is important, we need to go through the code and find all these instances. Presumably this would solve the problem?

  3. andrew_peterson reporter

    Note this is very related to Issue #73. To address this, we need to:

    • Look at every send_pyobj in the code and make sure all floats are numpy arrays.
    • Figure out a replacement for the ase Parameters dictionary.
  4. andrew_peterson reporter

    As far as the Parameters replacement, I wonder if we need to do something like this answer on stackoverflow?

    I'm not sure why the string has to be turned into a json object in the above example though -- couldn't the string itself be the output?

  5. Muammar El Khatib

    I was working today in #103. I was fortunate to do a calculation, using the script posted in the report, that gave me correct results (reproducibility was achieved!, yay! although I don’t know why). Then I performed another calculation with the same script for which things did not work as expected (no reproducibility).

    I proceeded to compare the log files between the second and third training (I hope the wording I am using is correct). I noted that when EnergyRMSE is “perturbed” too much, then the number of iterations is very different and the parameters contained in *.amp after convergence are completely different the ones from the others.

    Grepping here and there, I was able to print the loss function in amp/model/__init__.py defined as loss = results['loss'] and the reference one energy_loss = results['energy_loss'] . Is that correct?.

    Attached you will find 2 files corresponding to the values of such keys for the second and third training calculations. When inspecting (with a diff tool like meld), you will see that until after iteration 60 there are important fluctuations!.

    My question is; in your experience: what could be the possible causes that result in the predicted and reference loss energies to be that different in the class LossFunction for two successive training processes?. Maybe something is being read incorrectly?. In the calculation that worked as expected, there are no such differences in the EnergyRMSE values both log files are identical except from the time stamps of course.

    Do we stick to the precision problem? I ask that because after reading stackoverlow it seems that precision is not lost when using the pickle module.

    I will try to test if in each send_pyobj all floats are numpy arrays.

    Thanks,

  6. andrew_peterson reporter

    We can get more precision in the tostring functions by first running numpy.set_printoptions(precision=16), to get 16 digits of precision instead of 8, for example.

    I can see more precision in the resulting text files when I do this, but so far I haven't solved the problem with this approach.

  7. andrew_peterson reporter

    One more point of data: I tried the example script from my first post on this thread with fortran=False and cores=1. I get the same qualitative result as with fortran=True and cores=1: that is, runs 2 and 3 are the same, but they differ from run 1. So it looks like passing data between fortran and python is not an issue (or at least not the only issue). This still points to the saving of model parameters to disk, and by extension the related tostring method that is used to pass models between processes.

  8. andrew_peterson reporter

    @akhorshi :

    I noticed that the saved calculators produced with converged parameters (e.g., "01.amp") versus initial parameters (e.g., "01-initial-parameters.amp") differ slightly in their formatting. Specifically, the model weights exist as a standard dictionary in the initial parameters but as an OrderedDict in the converged parameters. I also see in the checkpoint files that checkpoint 0 has a standard dictionary and checkpoint 100 (and presumably thereafter) has OrderedDict.

    I would have expected them to be the same. Do you know why this is the case? It doesn't seem likely to be the cause of the bug at hand, but because the bug at hand involves these files we should make sure they are consistent.

  9. andrew_peterson reporter

    More slow progress: I'm now pretty certain it has to do with writing out the calculator parameters to the the text format. I was able to make my example above be reproducible (on 1 core) by first modifying where the Amp code writes out the initial parameters to also provide them in pickle format. In amp.model.neuralnetwork.NeuralNetwork:

    ...
        def get_loss(self, vector):
            """
            Method to be called by the regression master.
            Takes one and only one input, a vector of parameters.
            Returns one output, the value of the loss (cost) function.
    
            :param vector: Parameters of the regression model in the form of a
                           list.
            :type vector: list
            """
            if self.step == 0:
                filename = make_filename(self.parent.label,
                                         '-initial-parameters.amp')
                filename = self.parent.save(filename, overwrite=True)
                #FIXME/ap: kludge
                import pickle
                with open('%s-modelparameters.pickle' % self.parent.label, 'w') as f:
                    pickle.dump(self.parameters, f)
    ...
    

    Then I modified my above script to have a calculator (labeled 05) like below:

    calc = Amp.load('01-initial-parameters.amp', label='05', dblabel='amp', cores=1)
    with open('01-modelparameters.pickle', 'r') as f:
        p = pickle.load(f)
    calc.model.parameters = p
    calc.train('train.traj')
    

    Note that the Amp.load loads both the descriptor and model parameters (along with their classes), while the calc.model.parameters = p overrides the loaded model parameters with those from the pickle file. So it can only be that there is some difference in these two parameters dictionaries that leads to this issue.

    So I tried beefing up the digits output in the text file with numpy.set_printoptions(precision=30). This solved the problem for the cores=1 case. :)

    However, the problem still persists if i switch to parallel mode, with wide differences between all otherwise identical runs.

  10. Muammar El Khatib

    Today I have been normalizing to precision=30. I think multi-process is starting to "reproduce" the results in the second and third calculation :). The differences now are starting from the second decimal (before they were like completely different numbers). The problem is still not solved, but we will get there!.

    Note: When using more than 2 processors, the errors are larger. Note2: tostring method was also set to be 30 digits.

  11. Alireza Khorshidi

    A bit more update on this: If we turn off fortran in the above script, it is still not reproducible. So it seems that it has nothing to do with the fortran part. I am trying to find this issue in a less elaborate system (less images, less number of atoms, less number of parameters of NN) so that we can track the issue a bit more.

  12. Alireza Khorshidi

    Here is a less elaborate example which is not reproducible:

    from ase.build import fcc111
    from ase.visualize import view
    from ase.calculators.emt import EMT
    from ase.io import Trajectory
    
    from amp import Amp
    from amp.descriptor.gaussian import Gaussian
    from amp.model.neuralnetwork import NeuralNetwork
    from amp.model import LossFunction
    
    convergence = {'energy_rmse': 0.022,
                   'energy_maxresid': 10.**10.,
                   'force_rmse': 10.**(10.),
                   'force_maxresid': 10.**10., }
    
    # Create test data.
    traj = Trajectory('train.traj', 'w')
    atoms = fcc111('Cu', size=(2, 2, 1))
    atoms.set_calculator(EMT())
    atoms.get_potential_energy()
    traj.write(atoms)
    atoms[0].z += 0.5
    atoms.get_potential_energy()
    traj.write(atoms)
    atoms[0].y += 0.3
    atoms.get_potential_energy()
    traj.write(atoms)
    atoms[0].x += -0.33
    atoms.get_potential_energy()
    traj.write(atoms)
    
    # Train a calculator.
    # calc = Amp(descriptor=Gaussian(),
    #           model=NeuralNetwork(hiddenlayers = {'Cu': (2,),},),
    #           label='01', dblabel='amp',
    #           cores={'localhost': 1},
    #           )
    # calc.train('train.traj')
    
    # Now re-train from this one.
    calc = Amp.load('01-initial-parameters.amp', label='02', dblabel='amp',
                    cores={'localhost': 1},
                    )
    calc.descriptor.fortran = False
    calc.model.fortran = False
    lossfunction = LossFunction(convergence=convergence,
                                force_coefficient=0,
                                )
    calc.model.lossfunction = lossfunction
    calc.train('train.traj',
               train_forces=False,
               )
    
    # One more time (to eliminate saving/loading data as cause).
    calc = Amp.load('01-initial-parameters.amp', label='03', dblabel='amp',
                    cores={'localhost': 1},
                    )
    calc.descriptor.fortran = False
    calc.model.fortran = False
    lossfunction = LossFunction(convergence=convergence,
                                force_coefficient=0,
                                )
    calc.model.lossfunction = lossfunction
    calc.train('train.traj',
               train_forces=False,
               )
    

    The initial parameters are attached below.

  13. Alireza Khorshidi

    Like Andy has already observed, the inconsistency happens when the order of images changes. Studying more on the case of pure python with cores={'localhost': 1} and train_forces=False, I am seeing that the inconsistency between calc label '02' and calc label '03' starts from this line with a 10^(-16) deviation in "dloss_dparameters" (and then propagates through the optimizer, next round of parameters, next round of losses, and so on). It seems that summation is not constant under permutation (i.e. a+b is not exactly equal to b+a).

    I then googled "high precision summation", there are a few discussions how to do high precision summation. And a few routines for high precision summation have been suggested here. I tried lsum routine and it works better (though not perfect) than just simply "+=".

    Being said that, now we need to decide how to proceed: 1- One way is to increase the precision of our sums (Only sums over images should matter, since I could not think of any randomness other than the order of images (is there anything else with random order?)). We need to do it both in python and fortran, and I would guess it will not make the code totally reproducible but will improve it (i.e. we will see deviations after larger number of steps). 2- The other way that should make the code totally reproducible, is that we change the order of images at the beginning of training to a fixed order, e.g. alphabetical order of hashs.

    What do you think?

  14. Log in to comment