IOError: [Errno 2] No such file or directory: 'tfAmpNN-checkpoint' with tensorflow 0.12.1

Issue #124 resolved
Muammar El Khatib created an issue

I installed latest tensorflow version (0.12.1) using pip, and I tried to work on #111 on my laptop. There were some problems because of the latest cores/parallel implementation. I was able to fix them in a local branch for which the diff I have attached shows the changes made (tflowfix.diff).

However, there is the following problem now:

/home/muammar/.local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:91: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/home/muammar/.local/lib/python2.7/site-packages/numpy/core/numeric.py:190: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  a = empty(shape, dtype, order)
Traceback (most recent call last):
  File "test_gaussian_tflow.py", line 63, in <module>
    train_test()
  File "test_gaussian_tflow.py", line 56, in train_test
    calc.train(images=train_images,)
  File "/home/muammar/quimica_pura/posgrado/Postdoc/brown/git/amp/amp/__init__.py", line 308, in train
    parallel=self._parallel)
  File "/home/muammar/quimica_pura/posgrado/Postdoc/brown/git/amp/amp/model/tflow.py", line 654, in fit
    with open('tfAmpNN-checkpoint') as fhandle:
IOError: [Errno 2] No such file or directory: 'tfAmpNN-checkpoint'

I am trying to understand why tfAmpNN-checkpoint is not being created with tensorflow 0.12.1. Getting back to tensorflow 0.1.0 makes the problem go away:

pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0-cp27-none-linux_x86_64.whl

Comments (7)

  1. Muammar El Khatib reporter

    The problem is still there for TensorFlow 1.0. The attached patch is needed to port tflow.py.

  2. Muammar El Khatib reporter

    I have done these changes to the code that include importing the module saver_pb2:

    diff --git a/amp/model/__init__.py b/amp/model/__init__.py
    index 69ad0f9..733e2c2 100644
    --- a/amp/model/__init__.py
    +++ b/amp/model/__init__.py
    @@ -344,7 +344,7 @@ class LossFunction:
             if fingerprintprimes is not None:
                 descriptor.fingerprintprimes = fingerprintprimes
    
    -    def _initialize(self, args):
    +    def _initialize(self, args=None):
             """Procedures to be run on the first call only, such as establishing
             SSH sessions, etc."""
             if self._initialized is True:
    diff --git a/amp/model/tflow.py b/amp/model/tflow.py
    index 48e2d53..99218f1 100644
    --- a/amp/model/tflow.py
    +++ b/amp/model/tflow.py
    @@ -15,6 +15,7 @@ import uuid
    
     from . import LossFunction
     from ..utilities import ConvergenceOccurred
    +from tensorflow.core.protobuf import saver_pb2
    
     try:
         import tensorflow as tf
    @@ -672,7 +673,7 @@ class NeuralNetwork:
    
                 # loss function, as included in model/__init__.py
                 self.energy_loss = tf.reduce_sum(
    -                tf.square(tf.div(tf.sub(self.energy, self.y_),
    +                tf.square(tf.div(tf.subtract(self.energy, self.y_),
                                      self.nAtoms_in)))
                 # Define the training step for energy training.
    
    @@ -682,7 +683,7 @@ class NeuralNetwork:
                 # force loss function, as included in model/__init__.py
                 if self.parameters['relativeForceCutoff'] is None:
                     self.force_loss = tf.reduce_sum(
    -                    tf.div(tf.square(tf.sub(self.forces_in, self.forces)),
    +                    tf.div(tf.square(tf.subtract(self.forces_in, self.forces)),
                                self.nAtoms_forces)) / 3.
                     # tf.reduce_sum(tf.div(
                     # tf.reduce_mean(tf.square(tf.sub(self.forces_in,
    @@ -692,7 +693,7 @@ class NeuralNetwork:
                     self.force_loss = \
                         tf.reduce_sum(tf.div(tf.div(
                                              tf.square(
    -                                             tf.sub(
    +                                             tf.subtract(
                                                      self.forces_in, self.forces)),
                                              tf.square(
                                                  self.forces_in) +
    @@ -707,9 +708,9 @@ class NeuralNetwork:
    
                 # Define max residuals
                 self.energy_maxresid = tf.reduce_max(
    -                tf.abs(tf.div(tf.sub(self.energy, self.y_), self.nAtoms_in)))
    +                tf.abs(tf.div(tf.subtract(self.energy, self.y_), self.nAtoms_in)))
                 self.force_maxresid = tf.reduce_max(
    -                tf.abs(tf.sub(self.forces_in, self.forces)))
    +                tf.abs(tf.subtract(self.forces_in, self.forces)))
    
                 # Define the training step for force training.
                 if self.parameters['regularization_strength'] is not None:
    @@ -1512,14 +1513,14 @@ def model(x, segmentinds, keep_prob, input_keep_prob, batchsize,
    
         # multiply through with the dg/dx tensor, and sum along the components of g
         # to get a tensor of dE/dx (one row per atom considered, second dim =3)
    -    dEdx = tf.reduce_sum(tf.mul(dEdg_arranged_tile, dgdx), 1)
    +    dEdx = tf.reduce_sum(tf.multiply(dEdg_arranged_tile, dgdx), 1)
    
         # this should be a tensor of size (total atoms in training set)x3,
         # representing the contribution of each atom to the total energy via
         # interactions with elements of the current atom type
         dEdx_arranged = tf.unsorted_segment_sum(dEdx, dgdx_Xindices, totalNumAtoms)
    
    -    return tf.mul(reducedSum, mask), dEdx_arranged, l2_regularization
    +    return tf.multiply(reducedSum, mask), dEdx_arranged, l2_regularization
     #    dEg
     #    dEjdgj1 = tf.expand_dims(dEjdgj, 1)
     #    dEjdgj2 = tf.expand_dims(dEjdgj1, 1)
    

    Still fails because of the checkpoints not being created:

    ± % python test_gaussian_tflow.py                                                                                                                                                                           !10078
    /home/muammar/brown/git/ase/ase/lattice/surface.py:17: UserWarning: Moved to ase.build
      warnings.warn('Moved to ase.build')
    W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
    W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
    W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
    W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
    W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
    W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
    /home/muammar/.local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:91: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
      "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
    WARNING:tensorflow:From /home/muammar/brown/git/amp/amp/model/tflow.py:317: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
    Instructions for updating:
    Use `tf.global_variables_initializer` instead.
    Traceback (most recent call last):
      File "test_gaussian_tflow.py", line 84, in <module>
        train_test()
      File "test_gaussian_tflow.py", line 77, in train_test
        calc.train(images=train_images,)
      File "/home/muammar/brown/git/amp/amp/__init__.py", line 311, in train
        parallel=self._parallel)
      File "/home/muammar/brown/git/amp/amp/model/tflow.py", line 925, in fit
        with open('tfAmpNN-checkpoint') as fhandle:
    IOError: [Errno 2] No such file or directory: 'tfAmpNN-checkpoint'
    

    This is the case for both tensorflow 1.0 or latest 1.4.0.

  3. Log in to comment