-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running out of memory on HPC resources with a large SGP #346
Comments
Hi @potus28 , for now this out-of-memory error can only be solved by switching to a node with bigger memory. We have an MPI version of SGP implemented, which allows to scale to multiple nodes, however, that is only supported for offline training, and have not been integrated to on-the-fly training yet. We are still exploring other options but currently not developed yet for users usage. I think you can try a few more fresh start with different conditions/initializations, and collect some data. Afterwards, you can use the offline training on all the data you have collected from multiple on-the-fly trainings. Then you can either try the offline training as shown in our colab tutorial or our beta version of the MPI SGP if needed. |
Hi @YuuuXie , thank you so much for the reply and your suggestions. Just to make sure I understand, the offline training that you are referring to is from section 4 of this tutorial. For doing this, I would create an extxyz file that contains ground truth coordinates and forces for each frame that I want to include in the SGP. After creating the extxyz file, I would do the training as instructed in the tutorial. Is this correct? Thanks! |
@potus28 yes exactly. |
Awesome! Thank you so much and have a great day! |
Describe the bug
The system I am trying to simulate is quite large with 516 atoms total of furfural, furfural derivatives, surface adsorbed hydrogen, and molecular hydrogen over a molybdenum carbide surface. While I can run the on-the-fly simulation for 2432 steps with 180 calls to DFT, after the 180th call I get this error message when the program tries to update the SGP:
I'm a bit stuck on what I can do to simulate this system for longer. Any help or suggestions for what I can do to simulate this system for at least 10000 frames (5ps) without running out of memory would be much appreciated. Thanks!
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I expect the simulation to run for at least 10000 frames with no memory issues.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: