In principle, could fingerprint derivatives be parallelized within each image? It seems like since it is looping over atoms this should be embarassingly parallel; each worker would just feed back the components of the fingerprints it has calculated to the master.
Then when we are running on 8 or 16 cores, the force call could be cut by up to a factor of 8 or 16 --- so from 30 seconds to 2 seconds, for example. Does that sound right?
Of course, this is just for force calls --- in training mode it makes sense to do the easier task of parallelizing over images.