Running out of memory on HPC resources with a large SGP #346

potus28 · 2023-02-12T16:33:13Z

Describe the bug
The system I am trying to simulate is quite large with 516 atoms total of furfural, furfural derivatives, surface adsorbed hydrogen, and molecular hydrogen over a molybdenum carbide surface. While I can run the on-the-fly simulation for 2432 steps with 180 calls to DFT, after the 180th call I get this error message when the program tries to update the SGP:

 >> tail flare.out
type 8 forces mae: 0.2971 eV/A
type 8 forces mav: 1.6452 eV/A
type 1 forces mae: 0.1482 eV/A
type 1 forces mav: 1.0923 eV/A
type 42 forces mae: 0.1378 eV/A
type 42 forces mav: 0.9085 eV/A
type 6 forces mae: 0.1719 eV/A
type 6 forces mav: 1.2120 eV/A
Adding atom [395, 172, 284, 460, 472, 20, 422, 505, 486, 388, 206, 448, 511, 358, 401, 490, 154, 419, 430, 399, 415, 473, 433, 461, 389, 441, 446, 470, 436, 340, 390, 437, 440] to the training set.
Uncertainty: [0.02000318 0.         0.        ]

 >> tail flare.e8963376
/var/spool/slurmd/job8963376/slurm_script: line 36: 270007 Killed                  python run.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=8963376.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I'm a bit stuck on what I can do to simulate this system for longer. Any help or suggestions for what I can do to simulate this system for at least 10000 frames (5ps) without running out of memory would be much appreciated. Thanks!

To Reproduce
Steps to reproduce the behavior:

Contents of run.py

import numpy as np

from ase import units, Atoms
from ase.io import read, write
from ase.io.trajectory import Trajectory
from ase.spacegroup import crystal
from ase.build import surface, add_adsorbate
from ase.visualize import view
from ase.md.velocitydistribution import MaxwellBoltzmannDistribution, Stationary, ZeroRotation

from flare.bffs.gp import GaussianProcess
from flare.utils.parameter_helper import ParameterHelper
from flare.bffs.mgp import MappedGaussianProcess
from flare.bffs.gp.calculator import FLARE_Calculator
from flare.learners.otf import OTF

import flare.bffs.sgp._C_flare as flare_pp
from flare.bffs.sgp.sparse_gp import SGP_Wrapper
from flare.bffs.sgp.calculator import SGP_Calculator

from ase.calculators.espresso import Espresso, EspressoProfile
from ase.calculators.mycalculators import *

np.random.seed(12345)

dft_calc = CP2KCalculator(ecut=400, functional="rVV10", scf=30, orbital_transform=True)

# Create the system
def add_molecules(atoms, adsorbate, n=1, tolerance=2.0):
    """Try to insert n number of probe molecules"""

    cell = np.array(atoms.get_cell())
    anorm = np.sum(cell[:,0])
    bnorm = np.sum(cell[:,1])
    cnorm = np.sum(cell[:,2])

    adsorbate.set_cell(None)
    adsorbate.center()
    n_adsorbate_atoms = len(adsorbate)
    for i in range(n):
        phi, theta, psi = 360.0*np.random.rand(3)
        adsorbate.euler_rotate(phi, theta, psi, center="COM")
        overlap = True
        while overlap:
            n_atoms = len(atoms)
            r = np.matmul(cell, np.random.rand(3))
            adsorbate.center()
            adsorbate.translate(r)
            print(r)

            # No other atoms are in the box
            if len(adsorbate) - len(atoms)  == len(adsorbate):
                print("No overlap")
                overlap = False

            # Loop over all positions in the box and check for overlap
            else:
                noverlaps = 0
                for r in atoms.positions:
                    for rtest in adsorbate.positions:
                        dx = r[0] -rtest[0]
                        dy = r[1] -rtest[1]
                        dz = r[2] -rtest[2]

                        dx -= anorm*np.rint(dx / anorm)
                        dy -= bnorm*np.rint(dy / bnorm)
                        dz -= cnorm*np.rint(dz / cnorm)

                        dr = np.sqrt(dx**2 + dy**2 + dz**2)
                        if dr < tolerance:
                            noverlaps +=1
                if noverlaps ==  0:
                    overlap = False

        atoms += adsorbate



# Create initial structure.
vac_size = 10.0
nx = 4
ny = 5
atoms = surface(read("../../../resources/mp-1221498-Mo2C.cif"), indices=(1, 0, 1), layers=6, vacuum=vac_size)
atoms.pbc=True

add_adsorbate(atoms, "H", 1.5, offset=(0.5,0.5))

atoms_1 = atoms.repeat((nx,ny,1))

add_adsorbate(atoms_1, read("../../../resources/FAL.xyz"), 4.0, offset=(0.5,0.5))
add_molecules(atoms_1, read("../../../resources/FAL.xyz"), n=1, tolerance=3.0)
add_molecules(atoms_1, read("../../../resources/MF.xyz"), n=2, tolerance=3.0)
add_molecules(atoms_1, read("../../../resources/FOL.xyz"), n=2, tolerance=3.0)
add_molecules(atoms_1, read("../../../resources/H2.xyz"), n=32, tolerance=3.0)

# Make the cell upper triangular.
pos_1 = atoms_1.positions
cell_1 = np.array(atoms_1.cell)
symbols_1 = atoms_1.symbols

positions = np.zeros(pos_1.shape)
cell = np.zeros((3, 3))
n_atoms = len(atoms_1)

cell[0, 0] = cell_1[2, 2]
cell[1, 1] = cell_1[1, 1]
cell[1, 2] = cell_1[1, 0]
cell[2, 2] = cell_1[0, 0]

for n in range(n_atoms):
    pos = pos_1[n]
    new_pos = np.zeros(3)
    new_pos[0] = pos[2]
    new_pos[1] = pos[1]
    new_pos[2] = pos[0]
    positions[n] = new_pos

atoms = Atoms(symbols=symbols_1, positions=positions, cell=cell, pbc=True)
n_atoms = len(atoms)
species_map = {6: 0, 42: 1, 8: 2, 1: 3}  # Oxygen (atomic number 8) is species 2
single_atom_energies ={0:-146.53823039765, 1:-1850.7999738766, 2:-431.82189234201, 3:-12.731577079986}

# Randomly jitter the atoms to give nonzero forces in the first frame.
jitter_factor = 0.1
for atom_pos in atoms.positions:
    for coord in range(3):
        atom_pos[coord] += (2 * np.random.random() - 1) * jitter_factor

# MD Settings
temperature = 1500
timestep = 0.0005  # ps
number_of_steps = 100000
externalstress = 0
ttime = 25 * units.fs
ptime = 75*units.fs
bm = 368.75 * units.GPa  # O-Mo2C:314.31 :: H-Mo2C:368.75

if ptime is not None:
    pfactor = bm * ptime**2
else:
    pfactor = None

md_engine = "NPT"
md_dict = {
    "temperature_K": temperature,
    "externalstress": externalstress,
    "ttime": ttime,
    "pfactor": pfactor,
    "mask": (0, 1, 1),
}

MaxwellBoltzmannDistribution(atoms, temperature_K=temperature)
Stationary(atoms, preserve_temperature=False)
ZeroRotation(atoms, preserve_temperature=False)

# Create sparse GP model.
# For hyperparameters, we need to optimize cutoffs, power, N, and lmax
cutoff = 4.25  # in A
sigma = 2.0  # in eV
power = 2  # power of the dot product kernel
kernel = flare_pp.NormalizedDotProduct(sigma, power)
cutoff_function = "quadratic"
many_body_cutoffs = [cutoff]
radial_basis = "chebyshev"
radial_hyps = [0., cutoff]
cutoff_hyps = []
n_species = len(species_map)
N = 8
lmax = 3
descriptor_settings = [n_species, N, lmax]
descriptor_calculator = flare_pp.B2(
  radial_basis,
  cutoff_function,
  radial_hyps,
  cutoff_hyps,
  descriptor_settings
)

# Set the noise values.
sigma_e = 0.001 * n_atoms  # eV (1 meV/atom)
sigma_f = 0.05  # eV/A
sigma_s = 0.0006  # eV/A^3 (about 0.1 GPa)

# Choose uncertainty type.
# Other options are "DTC" (Deterministic Training Conditional) or
# "SOR" (Subset of Regressors).
variance_type = "local"  # Compute uncertainties on local energies (normalized)

# Choose settings for hyperparameter optimization.
max_iterations = 10  # Max number of BFGS iterations during optimization
opt_method = "L-BFGS-B"  # Method used for hyperparameter optimization

# Bounds for hyperparameter optimization.
# Keeps the energy noise from going to zero.
bounds = [(sigma, None), (sigma_e, None), (sigma_f, None), (sigma_s, None)]

# Create a model wrapper that is compatible with the flare code.
gp_model = SGP_Wrapper(
        [kernel],
        [descriptor_calculator],
        cutoff,
        sigma_e,
        sigma_f,
        sigma_s,
        species_map,
        variance_type=variance_type,
    energy_training=True,
    force_training=True,
    stress_training=True,
        single_atom_energies=single_atom_energies,
        opt_method=opt_method,
        bounds=bounds,
        max_iterations=max_iterations)

# Create an ASE calculator based on the GP model.
flare_calculator = SGP_Calculator(gp_model)

init_atoms = list(range(n_atoms))  # Initial environments to include in the sparse set
output_name = 'flare'              # Name of the output file
std_tolerance_factor = -0.2       # Uncertainty tolerance for calling DFT ( 0.01 for 1 species, 0.05 for 2, 0.1 for 3, etc
train_hyps = (20,30)               # Train hyperparameters optimization within this many DFT calls
min_steps_with_model = 2          # Min number of steps between DFT calls
update_style = "threshold"        # Strategy for adding sparse environments
update_threshold = 0.02           # Threshold for determining which sparse environments to add
force_only = False                 # Use only forces for training, when False also include energies and stress (Default is True)


otf_params = {
        'init_atoms': init_atoms,
        'output_name': output_name,
        'train_hyps': train_hyps,
    'std_tolerance_factor': std_tolerance_factor,
    'min_steps_with_model': min_steps_with_model,
    'update_style': update_style,
        'force_only': force_only,
    'update_threshold': update_threshold,
        'write_model': 3
}

otf = OTF(
        atoms,
        dt = timestep,
        number_of_steps = number_of_steps,
        flare_calc = flare_calculator,
        dft_calc = dft_calc,
        md_engine = md_engine,
        md_kwargs = md_dict,
        **otf_params
        )

otf.run()

Contents of run.slurm

#!/bin/bash
#SBATCH --job-name=flare
#SBATCH --output=flare.o%j
#SBATCH --error=flare.e%j
#SBATCH --nodes=10
#SBATCH --ntasks-per-node=40
#SBATCH --time=48:00:00
#SBATCH --partition=400p48h
#SBATCH --qos=funded
#SBATCH --exclusive
. "$CONDA_PREFIX/etc/profile.d/conda.sh"
ulimit -s unlimited
export OMP_NUM_THREADS=1
echo "SLURM TASKS: $SLURM_NTASKS"

module purge
module load gcc/11.3.0
module load openmpi/4.0.4
module load contrib/0.1
module load cp2k/2023.1
module load mkl/2020.2

export ASE_CP2K_COMMAND="srun cp2k_shell.psmp"

export LD_PRELOAD=$MKLROOT/lib/intel64/libmkl_core.so
export LD_PRELOAD=$LD_PRELOAD:$MKLROOT/lib/intel64/libmkl_sequential.so
export LD_PRELOAD=$LD_PRELOAD:$MKLROOT/lib/intel64/libmkl_def.so
export LD_PRELOAD=$LD_PRELOAD:$MKLROOT/lib/intel64/libmkl_avx2.so
export LD_PRELOAD=$LD_PRELOAD:$MKLROOT/lib/intel64/libmkl_intel_lp64.so
export LD_PRELOAD=$LD_PRELOAD:$MKLROOT/lib/intel64/libmkl_intel_thread.so


python run.py

If you hit the scheduler's wall time, the simulation can be restarted with:

from flare.learners.otf import OTF
otf = OTF.from_checkpoint("flare_checkpt.json")
otf.run()

The simulation runs successfully until the 2432 step or the 180th DFT call.

Expected behavior
I expect the simulation to run for at least 10000 frames with no memory issues.

Desktop (please complete the following information):

OS: CentOS
Version 7.6
Memory per node: 192 GB

The text was updated successfully, but these errors were encountered:

YuuuXie · 2023-02-12T16:44:05Z

Hi @potus28 , for now this out-of-memory error can only be solved by switching to a node with bigger memory. We have an MPI version of SGP implemented, which allows to scale to multiple nodes, however, that is only supported for offline training, and have not been integrated to on-the-fly training yet. We are still exploring other options but currently not developed yet for users usage.

I think you can try a few more fresh start with different conditions/initializations, and collect some data. Afterwards, you can use the offline training on all the data you have collected from multiple on-the-fly trainings. Then you can either try the offline training as shown in our colab tutorial or our beta version of the MPI SGP if needed.

potus28 · 2023-02-13T15:22:59Z

Hi @YuuuXie , thank you so much for the reply and your suggestions. Just to make sure I understand, the offline training that you are referring to is from section 4 of this tutorial. For doing this, I would create an extxyz file that contains ground truth coordinates and forces for each frame that I want to include in the SGP. After creating the extxyz file, I would do the training as instructed in the tutorial. Is this correct? Thanks!

YuuuXie · 2023-02-13T15:55:51Z

@potus28 yes exactly.

potus28 · 2023-02-13T15:59:37Z

Awesome! Thank you so much and have a great day!

taidbui · 2024-05-02T10:14:12Z

Hi @YuuuXie,

In the Section 4 of the tutorial (mentioned above by @potus28), how do we do the optimisation for hyper-parameters after the offline training step with large values of train_hyps? Much appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running out of memory on HPC resources with a large SGP #346

Running out of memory on HPC resources with a large SGP #346

potus28 commented Feb 12, 2023

YuuuXie commented Feb 12, 2023

potus28 commented Feb 13, 2023

YuuuXie commented Feb 13, 2023

potus28 commented Feb 13, 2023

taidbui commented May 2, 2024

Running out of memory on HPC resources with a large SGP #346

Running out of memory on HPC resources with a large SGP #346

Comments

potus28 commented Feb 12, 2023

YuuuXie commented Feb 12, 2023

potus28 commented Feb 13, 2023

YuuuXie commented Feb 13, 2023

potus28 commented Feb 13, 2023

taidbui commented May 2, 2024