db and simultaneous runs

Issue #9 resolved
andrew_peterson repo owner created an issue

I think there is an issue with simultaneous runs writing to the same database (of fingerprints, etc.), with the dblabel keyword. I anticipate this will be a rather common occurrence since a normal operation will be to submit several simultaneous jobs to see which one trains the best. For example, I saw the following error when attempting such a feat.

Process Process-1:
Traceback (most recent call last):
  File "/gpfs/runtime/opt/python/2.7.3/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/gpfs/runtime/opt/python/2.7.3/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/users/ap31/usr/svn/amp/amp/__init__.py", line 2324, in _calculate_der_fingerprints
    n_symbols = [atoms[n_index].symbol for n_index in n_self_indices]
  File "/users/ap31/usr/svn/ase/ase/ase/atoms.py", line 863, in __getitem__
    raise IndexError('Index out of range.')
IndexError: Index out of range.
Traceback (most recent call last):
  File "/var/spool/slurmd/job9788798/slurm_script", line 27, in <module>
    global_search=None)
  File "/users/ap31/usr/svn/amp/amp/__init__.py", line 866, in train
    _mp, io, data_format, save_memory)
  File "/users/ap31/usr/svn/amp/amp/__init__.py", line 1914, in __init__
    data_format)
  File "/users/ap31/usr/svn/amp/amp/utilities.py", line 681, in read
    c.execute("SELECT * FROM fingerprint_derivatives")
sqlite3.OperationalError: no such table: fingerprint_derivatives

Comments (5)

  1. Alireza Khorshidi

    We are not yet writing to a single database, but still taking the same approach as before: Writing the portion of each process to a temporary child-database, and then reading and unifying all the child-databases into a single database in the job directory.

    I tried to see if any issue happens when several jobs are simultaneously reading from a common database, but saw no issue with that.

    Do you remember where the issue you brought above happened? Maybe my recent commits have fixed it. Could we give it another shot?

  2. andrew_peterson reporter

    Yes, I think I recall the use case. Since it is often difficult to find parameters to make the model fit, it can be good to train several unique calculators simultaneously. This is what I was doing on CCV. That is, submit the same script several times, but have each use the same dblabel to avoid re-calculating the fingerprints over and over. It should work fine if I submit one script, wait for it to finish fingerprinting, then submit the rest. However, that is not very convenient.

  3. Alireza Khorshidi

    I tried five jobs simultaneously reading from a single database at:

    /gpfs/data/ap31/akhorshi/Prof. Peterson's Research/scratch/Pd/5000-images/03

    and did not see any error. However, multiple processes, I guess, are not allowed to write simultaneously to a single sqlite3 database, as will be needed in the save_memory mode; I should find a solution to that.

  4. Alireza Khorshidi

    I tried five jobs simultaneously reading from a single database at:

    /gpfs/data/ap31/akhorshi/Prof. Peterson's Research/scratch/Pd/5000-images/03

    and did not see any error. However, multiple processes, I guess, are not allowed to write simultaneously to a single sqlite3 database, as will be needed in the save_memory mode; I should find a solution to that.

  5. Log in to comment