job suspended in the case of large number of hidden layer nodes

Issue #125 resolved
Geng Sun created an issue

Hello, eveyone,

I encounter a problem when I use the AMP. By now I can not figure out this is a bug or just computer error yet. I greatly appreciate your suggestions and comments!

The question is that when I increase the number of nodes in hiddern layers gradually, like from hiddenlayers=(20,20) hiddenlayers=(21,21) to hiddenlayers=(36,36).

The training is OK when I use the number of nodes under 32, i.e hiddenlayers=(32,32). But when I use the number of nodes larger than 33, the program will be suspended after printing the title part:

                                                          Energy                     Force
 Step                Time   Loss (SSD)   EnergyRMSE     MaxResid    ForceRMSE     MaxResid
===== =================== ============ ============ ============ ============ ============

So, how could I fix this ? The attachment is training set and script I used for training.

Thank you very much!

Geng

Comments (8)

  1. andrew_peterson repo owner

    My initial guess is you are running out of memory, but I will take a look at it.

    I notice that in your script you have some code to determine the parallel configuration:

    try:
        hostfile=os.getenv('PE_HOSTFILE')
        cores={}
        with open(hostfile) as ifile:
            for i,istr in enumerate(ifile):
                hostname,nc=istr.split()[0:2]
                nc=int(nc)
                cores[hostname]=nc
    except:
        raise RuntimeError("Don't know PE_HOSTFILE")
    

    What queuing system are you on? We can add this blurb to amp.utilities.assign_cores to make this automated for you and other users on that system. Do you mind if I do that?

  2. andrew_peterson repo owner

    Fix issue #125 on truncated dictionary strings.

    Had to set np.set_printoptions. We probably should come up with a better solution when we address the reproducibility issue.

    Also improved the logging.

    → <<cset 0805092024dd>>

  3. andrew_peterson repo owner
    • changed status to open
    • edited description

    I am leaving this open until Geng can confirm this fixed the issue.

  4. Geng Sun reporter

    Dear Andrew,

    Thanks, I will test the new branch and report the results soon,

    Of course, you can add this script to the code,

    I am using the UGE(Univa Grid Engine) 8.0.1

    Geng

  5. andrew_peterson repo owner

    I just put your core-detection patch in commit b8d95fb. Would you mind also checking if this now works as expected on your queuing system?

  6. Geng Sun reporter

    Dear Andrew,

    Now the AMP run smoothly, and the automatic core-detection also works well!

    Geng

  7. Muammar El Khatib

    Merge branch 'master' into symmetryfunctions

    • master: Fix issue #125 on truncated dictionary strings. In slurm, use 'localhost' for a single node. Should now detect multiple nodes on slurm. More sensible error message in parallel config. Convergence plot updates. Make atomic energies accessible.

    → <<cset 448d4c58024c>>

  8. Log in to comment