How to handle memory?

There has been a lot of discussion lately on amp-users and in person on how to handle memory for different use cases. Concerns:

What to hold in-memory versus on-disk? In some cases, we know we will only use the fingerprints once (like in making a parity plot), so keeping them in memory can cause out-of-memory errors. In other cases (training) we want to use them over and over. In still others (MD), we may want to store these neither in memory or on disk since we'll never use them again.
File database: Update to SQLite? The file database is a bit of a kludge. Could we use SQLite? In this case, the rows of the database would be strings that are pickles (or otherwise encoded) of the numpy arrays. They would also optionally be compressed right in the row itself (as opposed to compressing the whole database). This seems to work in some crude tests.
Tarfile is slow. Python's tarfile module makes things very slow. Switching to SQLite with individually compressed entries seems like it would fix this.
Pickle security. Pickle is not a great format to store things in, but it's the default used by numpy to store arrays. Anything better we can come up with? We want to avoid round-off errors at all costs and keep things relatively simple.
Do we need individual control of neighborlists, fingerprints, fingerprintprimes? Or would a single keyword do to map behavior for everything? For example, Zack said he'd rather calculate neighborlists on the fly as it's faster than reading from disk, so there could be some advantages to individual control, but it's more complicated with the current code structure.
Does the master process hold a duplicate list? The master now passes the values to the workers -- does this result in doubling of the memory requirements if master is hanging onto them? Similarly is there any issue with f2py?
Does deleting objects actually free up the memory in python? Our current argument is yes, because the numpy arrays should be of the same size for basically all fingerprints, etc.

Use case	Memory Dictionary	Disk Database
Training	X	X
Re-training in same run	X	-
Validation / parity plots	-	X
Molecular dynamics, etc.	-	-

We are still working out a plan to implement this, but these are the issues to address.

Comments (2)