Parallel computing is not compatible with that of GPAW

Issue #110 resolved
Yin-Jia Zhang created an issue

When a computation contains both AMP and GPAW calculators, the parallel computation fails on Brown CCV, but works if using only one core. When a computation contains both AMP and another DFT calculator, such as EMT, the parallel computation works well. On CCV, the command 'mpiexec gpaw-python (your file)' is used to execute a parallel computing for GPAW. And the error points to the AMP descriptor.neighborlist.calculate_items. The error message file is also attached. Thanks!

Comments (7)

  1. andrew_peterson repo owner

    I'm not too surprised -- GPAW uses a difficult parallelization scheme (in my opinion), where the entire python process is run independently 16 times, for example, if you are running on 16 cores. This makes it hard to use with other bits of python. I personally wish GPAW ran on a single thread like dacapo or Amp or FileIOCalculators do, then just splits out to parallel mode when the math gets hard.

    We should think about how to systematically address this. We may need some if rank==0 type statements in the code somewhere.

    Also, I noticed in your output that your version of Amp is probably a few months old. It crashes when using shelve, which we no longer use because it has issues with multiple processes. Can you update your Amp then give this another try? (I don't expect it will solve the problem, but will make sure we are addressing the current problem(s).)

  2. Muammar El Khatib

    I think I found a solution for my class that I will let here for reference:

    MPI spawns a new python process (like a clone) for the workers, and using mpi4py one can get the rank number that certain "clone" is running:

    try:
        from mpi4py import MPI
        comm = MPI.COMM_WORLD
        rank = comm.Get_rank()
    except:
        rank = 0
    

    Then, as suggested by @andrewpeterson I added an if rank == 0 to those parts causing problems, and the problem is solved. Should we write something about it in the documentation?.

  3. andrew_peterson repo owner

    It's really more of a GPAW thing than an Amp thing. GPAW doesn't work with any python scripts unless they are specially written to be running on n cores at the same time.

    Can you give an idea of what your script was trying to do?

  4. Muammar El Khatib

    I have implemented the algorithm for accelerating NEB calculations proposed in your paper. The training phase was failing because the whole script was launched with gpaw-python. So, inside the class I created rank which is a global variable and if rank != 0 nothing related to Amp will be executed.

  5. Log in to comment