Reproducibility in training
It seems we have some issue with reproducibility during training. Below is a minimal script.
from ase.build import fcc111
from ase.visualize import view
from ase.calculators.emt import EMT
from ase.io import Trajectory
from amp import Amp
from amp.descriptor.gaussian import Gaussian
from amp.model.neuralnetwork import NeuralNetwork
# Create test data.
traj = Trajectory('train.traj', 'w')
atoms = fcc111('Cu', size=(2,2,1))
atoms.set_calculator(EMT())
atoms.get_potential_energy()
traj.write(atoms)
atoms[0].z += 0.5
atoms.get_potential_energy()
traj.write(atoms)
atoms[0].y += 0.3
atoms.get_potential_energy()
traj.write(atoms)
atoms[0].x += -0.33
atoms.get_potential_energy()
traj.write(atoms)
# Train a calculator.
calc = Amp(descriptor=Gaussian(), model=NeuralNetwork(),
label='01', dblabel='amp')
calc.train('train.traj')
# Now re-train from this one.
calc = Amp.load('01-initial-parameters.amp', label='02', dblabel='amp')
calc.train('train.traj')
# One more time (to eliminate saving/loading data as cause).
calc = Amp.load('01-initial-parameters.amp', label='03', dblabel='amp')
calc.train('train.traj')
The results are of course different everytime this is run (due to a different initial parameters guess in the first calculator). But I am not even seeing a consistent number of steps to converge in each of these. Calc 01 took 1128 steps, 02 took 778, and 03 took 793. So it is not just a simple matter of losing some precision on writing / reading from disk.
What could be causing this? Is there some randomness in the scipy optimizer?
Comments (16)
-
reporter -
reporter We think we have worked it out. ZMQ's
send_pyobj
command first pickles and object before sending it. If the object is a numpy array, there is no precision loss (we think) as it is kept as a binary representation. However, if we pickle a float it is first converted to a base-10 string:>>> import pickle >>> import numpy >>> a = 1./3. >>> pickle.dumps(a) 'F0.3333333333333333\n.' >>> pickle.dumps(numpy.array(a)) "cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(tcnumpy\ndtype\np6\n(S'f8'\np7\nI0\nI1\ntp8\nRp9\n(I3\nS'<'\np10\nNNNI-1\nI-1\nI0\ntp11\nbI00\nS'UUUUUU\\xd5?'\np12\ntp13\nb."
Further (worse?), we are using
ase.calculators.calculator.Parameters.tostring
at various points in the code, which allows quickly serializing a calculator. Inside ASE, it uses'%r'
to represent items in the dictionary, which can lead to even fewer digits:>>> '%r' % a '0.3333333333333333' >>> '%r' % numpy.array(a) 'array(0.3333333333333333)' >>> '%r' % numpy.array([a]) 'array([ 0.33333333])'
If reproducibility is important, we need to go through the code and find all these instances. Presumably this would solve the problem?
-
reporter Note this is very related to Issue #73. To address this, we need to:
- Look at every
send_pyobj
in the code and make sure all floats are numpy arrays. - Figure out a replacement for the ase Parameters dictionary.
- Look at every
-
reporter As far as the Parameters replacement, I wonder if we need to do something like this answer on stackoverflow?
I'm not sure why the string has to be turned into a json object in the above example though -- couldn't the string itself be the output?
-
I was working today in #103. I was fortunate to do a calculation, using the script posted in the report, that gave me correct results (reproducibility was achieved!, yay! although I don’t know why). Then I performed another calculation with the same script for which things did not work as expected (no reproducibility).
I proceeded to compare the log files between the second and third training (I hope the wording I am using is correct). I noted that when
EnergyRMSE
is “perturbed” too much, then the number of iterations is very different and the parameters contained in*.amp
after convergence are completely different the ones from the others.Grepping here and there, I was able to print the loss function in
amp/model/__init__.py
defined asloss = results['loss']
and the reference oneenergy_loss = results['energy_loss']
. Is that correct?.Attached you will find 2 files corresponding to the values of such keys for the second and third training calculations. When inspecting (with a diff tool like meld), you will see that until after iteration 60 there are important fluctuations!.
My question is; in your experience: what could be the possible causes that result in the predicted and reference loss energies to be that different in the class
LossFunction
for two successive training processes?. Maybe something is being read incorrectly?. In the calculation that worked as expected, there are no such differences in theEnergyRMSE
values both log files are identical except from the time stamps of course.Do we stick to the precision problem? I ask that because after reading stackoverlow it seems that precision is not lost when using the pickle module.
I will try to test if in each send_pyobj all floats are numpy arrays.
Thanks,
-
- attached 2loss.NOK
Second training calculation.
-
- attached 3loss.NOK
Third training calculation.
-
reporter We can get more precision in the
tostring
functions by first runningnumpy.set_printoptions(precision=16)
, to get 16 digits of precision instead of 8, for example.I can see more precision in the resulting text files when I do this, but so far I haven't solved the problem with this approach.
-
reporter One more point of data: I tried the example script from my first post on this thread with fortran=False and cores=1. I get the same qualitative result as with fortran=True and cores=1: that is, runs 2 and 3 are the same, but they differ from run 1. So it looks like passing data between fortran and python is not an issue (or at least not the only issue). This still points to the saving of model parameters to disk, and by extension the related tostring method that is used to pass models between processes.
-
reporter @akhorshi :
I noticed that the saved calculators produced with converged parameters (e.g., "01.amp") versus initial parameters (e.g., "01-initial-parameters.amp") differ slightly in their formatting. Specifically, the model weights exist as a standard dictionary in the initial parameters but as an OrderedDict in the converged parameters. I also see in the checkpoint files that checkpoint 0 has a standard dictionary and checkpoint 100 (and presumably thereafter) has OrderedDict.
I would have expected them to be the same. Do you know why this is the case? It doesn't seem likely to be the cause of the bug at hand, but because the bug at hand involves these files we should make sure they are consistent.
-
reporter More slow progress: I'm now pretty certain it has to do with writing out the calculator parameters to the the text format. I was able to make my example above be reproducible (on 1 core) by first modifying where the Amp code writes out the initial parameters to also provide them in pickle format. In
amp.model.neuralnetwork.NeuralNetwork
:... def get_loss(self, vector): """ Method to be called by the regression master. Takes one and only one input, a vector of parameters. Returns one output, the value of the loss (cost) function. :param vector: Parameters of the regression model in the form of a list. :type vector: list """ if self.step == 0: filename = make_filename(self.parent.label, '-initial-parameters.amp') filename = self.parent.save(filename, overwrite=True) #FIXME/ap: kludge import pickle with open('%s-modelparameters.pickle' % self.parent.label, 'w') as f: pickle.dump(self.parameters, f) ...
Then I modified my above script to have a calculator (labeled 05) like below:
calc = Amp.load('01-initial-parameters.amp', label='05', dblabel='amp', cores=1) with open('01-modelparameters.pickle', 'r') as f: p = pickle.load(f) calc.model.parameters = p calc.train('train.traj')
Note that the
Amp.load
loads both the descriptor and model parameters (along with their classes), while thecalc.model.parameters = p
overrides the loaded model parameters with those from the pickle file. So it can only be that there is some difference in these two parameters dictionaries that leads to this issue.So I tried beefing up the digits output in the text file with
numpy.set_printoptions(precision=30)
. This solved the problem for the cores=1 case. :)However, the problem still persists if i switch to parallel mode, with wide differences between all otherwise identical runs.
-
Today I have been normalizing to
precision=30
. I think multi-process is starting to "reproduce" the results in the second and third calculation :). The differences now are starting from the second decimal (before they were like completely different numbers). The problem is still not solved, but we will get there!.Note: When using more than 2 processors, the errors are larger. Note2:
tostring
method was also set to be 30 digits. -
A bit more update on this: If we turn off fortran in the above script, it is still not reproducible. So it seems that it has nothing to do with the fortran part. I am trying to find this issue in a less elaborate system (less images, less number of atoms, less number of parameters of NN) so that we can track the issue a bit more.
-
Here is a less elaborate example which is not reproducible:
from ase.build import fcc111 from ase.visualize import view from ase.calculators.emt import EMT from ase.io import Trajectory from amp import Amp from amp.descriptor.gaussian import Gaussian from amp.model.neuralnetwork import NeuralNetwork from amp.model import LossFunction convergence = {'energy_rmse': 0.022, 'energy_maxresid': 10.**10., 'force_rmse': 10.**(10.), 'force_maxresid': 10.**10., } # Create test data. traj = Trajectory('train.traj', 'w') atoms = fcc111('Cu', size=(2, 2, 1)) atoms.set_calculator(EMT()) atoms.get_potential_energy() traj.write(atoms) atoms[0].z += 0.5 atoms.get_potential_energy() traj.write(atoms) atoms[0].y += 0.3 atoms.get_potential_energy() traj.write(atoms) atoms[0].x += -0.33 atoms.get_potential_energy() traj.write(atoms) # Train a calculator. # calc = Amp(descriptor=Gaussian(), # model=NeuralNetwork(hiddenlayers = {'Cu': (2,),},), # label='01', dblabel='amp', # cores={'localhost': 1}, # ) # calc.train('train.traj') # Now re-train from this one. calc = Amp.load('01-initial-parameters.amp', label='02', dblabel='amp', cores={'localhost': 1}, ) calc.descriptor.fortran = False calc.model.fortran = False lossfunction = LossFunction(convergence=convergence, force_coefficient=0, ) calc.model.lossfunction = lossfunction calc.train('train.traj', train_forces=False, ) # One more time (to eliminate saving/loading data as cause). calc = Amp.load('01-initial-parameters.amp', label='03', dblabel='amp', cores={'localhost': 1}, ) calc.descriptor.fortran = False calc.model.fortran = False lossfunction = LossFunction(convergence=convergence, force_coefficient=0, ) calc.model.lossfunction = lossfunction calc.train('train.traj', train_forces=False, )
The initial parameters are attached below.
-
- attached 01-initial-parameters.amp
-
Like Andy has already observed, the inconsistency happens when the order of images changes. Studying more on the case of pure python with cores={'localhost': 1} and train_forces=False, I am seeing that the inconsistency between calc label '02' and calc label '03' starts from this line with a 10^(-16) deviation in "dloss_dparameters" (and then propagates through the optimizer, next round of parameters, next round of losses, and so on). It seems that summation is not constant under permutation (i.e. a+b is not exactly equal to b+a).
I then googled "high precision summation", there are a few discussions how to do high precision summation. And a few routines for high precision summation have been suggested here. I tried lsum routine and it works better (though not perfect) than just simply "+=".
Being said that, now we need to decide how to proceed: 1- One way is to increase the precision of our sums (Only sums over images should matter, since I could not think of any randomness other than the order of images (is there anything else with random order?)). We need to do it both in python and fortran, and I would guess it will not make the code totally reproducible but will improve it (i.e. we will see deviations after larger number of steps). 2- The other way that should make the code totally reproducible, is that we change the order of images at the beginning of training to a fixed order, e.g. alphabetical order of hashs.
What do you think?
- Log in to comment
I tried the above script with
cores=1
. Now calc 02 and calc 03 appear identical. So perhaps there is some precision loss when passing numbers around with ZMQ? (Note calc 01 is still different than calc 02; this could be caused by precision loss when writing to disk.)