job suspended in the case of large number of hidden layer nodes
Hello, eveyone,
I encounter a problem when I use the AMP. By now I can not figure out this is a bug or just computer error yet. I greatly appreciate your suggestions and comments!
The question is that when I increase the number of nodes in hiddern layers gradually, like from hiddenlayers=(20,20) hiddenlayers=(21,21) to hiddenlayers=(36,36).
The training is OK when I use the number of nodes under 32, i.e hiddenlayers=(32,32). But when I use the number of nodes larger than 33, the program will be suspended after printing the title part:
Energy Force
Step Time Loss (SSD) EnergyRMSE MaxResid ForceRMSE MaxResid
===== =================== ============ ============ ============ ============ ============
So, how could I fix this ? The attachment is training set and script I used for training.
Thank you very much!
Geng
Comments (8)
-
repo owner -
repo owner - changed status to resolved
Fix issue
#125on truncated dictionary strings.Had to set np.set_printoptions. We probably should come up with a better solution when we address the reproducibility issue.
Also improved the logging.
→ <<cset 0805092024dd>>
-
repo owner - changed status to open
- edited description
I am leaving this open until Geng can confirm this fixed the issue.
-
reporter Dear Andrew,
Thanks, I will test the new branch and report the results soon,
Of course, you can add this script to the code,
I am using the UGE(Univa Grid Engine) 8.0.1
Geng
-
repo owner I just put your core-detection patch in commit b8d95fb. Would you mind also checking if this now works as expected on your queuing system?
-
reporter Dear Andrew,
Now the AMP run smoothly, and the automatic core-detection also works well!
Geng
-
repo owner - changed status to resolved
Great!
-
Merge branch 'master' into symmetryfunctions
- master:
Fix issue
#125on truncated dictionary strings. In slurm, use 'localhost' for a single node. Should now detect multiple nodes on slurm. More sensible error message in parallel config. Convergence plot updates. Make atomic energies accessible.
→ <<cset 448d4c58024c>>
- master:
Fix issue
- Log in to comment
My initial guess is you are running out of memory, but I will take a look at it.
I notice that in your script you have some code to determine the parallel configuration:
What queuing system are you on? We can add this blurb to amp.utilities.assign_cores to make this automated for you and other users on that system. Do you mind if I do that?