MLIP Training Tutorial
This tutorial demonstrates how to use ChemRefine to train a Machine Learning Interatomic Potential (MLIP) using DFT data generated during the workflow.
Overview
Training an MLIP involves generating reference data, running the training process, and validating the trained model on new configurations.
ChemRefine automates this multi-step process:
-
Global Optimization (GOAT)
Performs a global search of the PES to identify low-energy conformers. -
Normal Mode Sampling (NMS)
Generates additional diverse geometries by displacing atoms along vibrational modes. -
Reference DFT Optimizations (OPT+SP)
Provides high-quality energies and forces for MLIP training. -
MLIP Training (MLFF_TRAIN)
Trains a potential (e.g., MACE) on the generated DFT dataset. We will be using MACE to train, as of writing this code (v1.2.1), Chemefine can only train/finetune with MACE. We need a MACE input yaml, explanation can be found here. -
MLIP Validation (OPT+SP with MLFF)
Applies the trained model to evaluate new structures, testing its accuracy and efficiency.
Prerequisites
- Installed ChemRefine (see Installation Guide)
- Access to an ORCA executable (for DFT reference calculations)
- Example molecule and YAML input from the repository
Input Files
We start with an initial structure located in the templates folder:
Orca Input Files
You can find the ORCA input files here
Interactive 3D Viewer
YAML Configuration
The full YAML input for this MLIP training workflow is included:
➡️ Examples/Tutorials/MLIP-Training/input.yaml
Download the template files here
Example content:
orca_executable: /mfs/io/groups/sterling/software-tools/orca/orca_6_1_0_avx2/orca
charge: 0
multiplicity: 1
initial_xyz: ./templates/step1.xyz
steps:
- step: 1
operation: "GOAT"
engine: "DFT"
sample_type:
method: "integer"
parameters:
num_structures: 15
- step: 2
operation: "OPT+SP"
engine: "DFT"
normal_mode_sampling: True
normal_mode_sampling_parameters:
calc_type: "random"
displacement_vector: 1.0
num_random_displacements: 1
sample_type:
method: "integer"
parameters:
num_structures: 0
- step: 3
operation: "OPT+SP"
engine: "DFT"
sample_type:
method: "integer"
parameters:
num_structures: 0
- step: 4
operation: "MLFF_TRAIN"
sample_type:
method: "integer"
parameters:
num_structures: 0
- step: 5
operation: "OPT+SP"
engine: "MLFF"
mlff:
model_name: "../step3/checkpoints_dir/goat_model_run-123_stagetwo.model"
task_name: "mace_off"
device: "cuda"
sample_type:
method: "integer"
parameters:
num_structures: 0
How to Run
Before running ChemRefine, ensure that:
- The ChemRefine environment is activated
- The ORCA executable path is correct
- The template directory (
./templates/
) contains the initial structure - The YAML config matches your dataset and workflow
Option 1: Run from the Command Line
chemrefine input.yaml --maxcores <N>
Here N is the number of simultaneous cores you want to use.
Option 2: Run with SLURM Script
On HPC systems with SLURM, submit the training workflow as a batch script:
➡️ Example ChemRefine SLURM script
#!/bin/bash
#SBATCH --partition=cpu
#SBATCH --cpus-per-task=1
#SBATCH --mem=32G
#SBATCH --time=72:00:00
#SBATCH --job-name=mlip_training
#SBATCH --output=%x.out
#SBATCH --error=%x.err
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
chemrefine input.yaml --maxcores 8