MLIP Training Tutorial

This tutorial demonstrates how to use ChemRefine to train a Machine Learning Interatomic Potential (MLIP) using DFT data generated during the workflow.

Overview

Training an MLIP involves generating reference data, running the training process, and validating the trained model on new configurations.
ChemRefine automates this multi-step process:

Global Optimization (GOAT)
Performs a global search of the PES to identify low-energy conformers.
Normal Mode Sampling (NMS)
Generates additional diverse geometries by displacing atoms along vibrational modes.
Reference DFT Optimizations (OPT+SP)
Provides high-quality energies and forces for MLIP training.
MLIP Training (MLFF_TRAIN)
Trains a potential (e.g., MACE) on the generated DFT dataset. We will be using MACE to train, as of writing this code (v1.2.1), Chemefine can only train/finetune with MACE. We need a MACE input yaml, explanation can be found here.
MLIP Validation (OPT+SP with MLFF)
Applies the trained model to evaluate new structures, testing its accuracy and efficiency.

Prerequisites

Installed ChemRefine (see Installation Guide)
Access to an ORCA executable (for DFT reference calculations)
Example molecule and YAML input from the repository

Input Files

We start with an initial structure located in the templates folder:

📄 View input.yaml
📄 View step1.xyz

Orca Input Files

You can find the ORCA input files here

Interactive 3D Viewer

YAML Configuration

The full YAML input for this MLIP training workflow is included:

➡️ Examples/Tutorials/MLIP-Training/input.yaml

Download the template files here

Example content:

orca_executable: /mfs/io/groups/sterling/software-tools/orca/orca_6_1_0_avx2/orca
charge: 0
multiplicity: 1

initial_xyz: ./templates/step1.xyz

steps:
  - step: 1
    operation: "GOAT"
    engine: "DFT"
    sample_type:
      method: "integer"
      parameters:
       num_structures: 15

  - step: 2
    operation: "OPT+SP"
    engine: "DFT"
    normal_mode_sampling: True
    normal_mode_sampling_parameters:
      calc_type: "random"
      displacement_vector: 1.0
      num_random_displacements: 1
    sample_type:
      method: "integer"
      parameters:
       num_structures: 0

  - step: 3
    operation: "OPT+SP"
    engine: "DFT"
    sample_type:
      method: "integer"
      parameters:
        num_structures: 0

  - step: 4
    operation: "MLFF_TRAIN"
    sample_type:
      method: "integer"
      parameters:
        num_structures: 0

  - step: 5
    operation: "OPT+SP"
    engine: "MLFF"
    mlff:
      model_name: "../step3/checkpoints_dir/goat_model_run-123_stagetwo.model"
      task_name: "mace_off"
      device: "cuda"
    sample_type:
      method: "integer"
      parameters:
       num_structures: 0

How to Run

Before running ChemRefine, ensure that:

The ChemRefine environment is activated
The ORCA executable path is correct
The template directory (./templates/) contains the initial structure
The YAML config matches your dataset and workflow

Option 1: Run from the Command Line

chemrefine input.yaml --maxcores <N>

Here N is the number of simultaneous cores you want to use.

Option 2: Run with SLURM Script

On HPC systems with SLURM, submit the training workflow as a batch script:

➡️ Example ChemRefine SLURM script

#!/bin/bash
#SBATCH --partition=cpu
#SBATCH --cpus-per-task=1
#SBATCH --mem=32G
#SBATCH --time=72:00:00
#SBATCH --job-name=mlip_training
#SBATCH --output=%x.out
#SBATCH --error=%x.err

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

chemrefine input.yaml --maxcores 8