Automated Workflow for Conformer Sampling and Refinement.
This repository contains a streamlined Python code for automated ORCA workflow for conformer sampling, TS finding, and refinement for DFT and MLIPs. The code automates the process of progressively refining the level of theory, eliminating the need for manual intervention. This code seamlessly integrates state-of-the-art MLIP's that can be accessed through ORCA inputs. This code is meant for HPC slurm submission system. Using an input yaml file we are able to automate the process of submitting calculations and then choosing a sampling method to choose the favored conformations, to then refine the calculation with more precise methods.
Features
- Automated workflow for conformer sampling and refinement
- Progressive refinement of computational level across multiple steps
- Intelligent sampling with multiple selection algorithms (energy window, Boltzmann, integer-based)
- HPC integration with automatic SLURM job management and resource optimization
- Built-in analysis with CSV output and structure filtering
- Flexible configuration via YAML input files
- Error reduction and efficient resource utilization
- Machine Learning Interatomic potentials integration using pretrained
mace
andFairChem models
models for fast geometry optimisation, molecular dynamics, and more.
Installation
Development Installation
#Pip install[Recommended]
pip install "chemrefine @ git+https://github.com/sterling-group/ChemRefine.git@main"
# Installing from Source
git clone https://github.com/sterling-group/ChemRefine.git
cd ChemRefine
# Install in development mode
pip install -e .
Requirements
Everything is managed through the pip installation.
- Python 3.6+ or < 3.13 with the following dependencies:
- numpy
- Numerical computations
- pyyaml
- YAML configuration parsing
- pandas
- Data analysis and CSV handling
- ase
- Geometry handling and optimisation
- mace-torch
- Machine learning force fields
- torch == 2.5.1
- Machine Learning (if you use later version of Pytorch it might not work with UMA models)
- ORCA 6.0+ - Quantum chemistry calculations
- SLURM - Job scheduling system
- MLIP Engines - MACE, FAIRChem, Sevenn, Orb
Tutorial
You can find examples for running multiple calculations that were in our publication in our Tutorial
Quick Start
1. Prepare Input Files
Create the required input files in your working directory:
- YAML Configuration (
input.yaml
): Defines the workflow steps - Initial XYZ (
step1.xyz
): Starting molecular geometry - ORCA Templates (
step1.inp
,step2.inp
,step3.inp
...orca.slurm.header
,mlff.slurm.header
): Calculation templates for each step
You must provide one ORCA input file (e.g., step1.inp
, step2.inp
, etc.) for each step defined in your input.yaml
configuration file, these must be found where you defined your template
directory . For example, if your input.yaml
specifies three ORCA steps, then you need three corresponding ORCA input files in your templates directory.
ChemRefine provides seamless MLIP integration through the use of the tool ExtOpt in Orca, which uses the ORCA optimization codes paired with ASE, you can use any optimization function of ORCA with MLIPS. For more information.
In addition to these input files, you must include one of each:
- cpu.slurm.header
: A SLURM submission script header with your cluster-specific job settings (e.g., partition, time limit, memory).
- cuda.slurm.header
: Required for MLFF jobs. Include your GPU node configuration here so MLFF calculations run under SLURM.
2. Run the Workflow
# Basic usage
chemrefine input.yaml
# With custom core count
chemrefine input.yaml --maxcores 128
# Background execution (recommended for HPC)
nohup chemrefine input.yaml --maxcores 128 &
# Skip any step (if already completed)
chemrefine input.yaml --skip
Error Correction
Often times DFT or MLIP calculations tend to fail, making the workflow not work as seamlessly. ChemRefine uses a caching system that saves a json and a pickle with all of the variables for that step in _cache
directory inside the step folder. This allows ChemRefine to continue to the next step if the workflow gets interrupted. If calculations die, we have added features to correct this:
# 1st: Re-run failed calculations (may require adjusting their parameters)
chemrefine input.yaml --rerun_errors
#2nd: Rebuild the cache
chemrefine input.yaml --rebuild_cache
#Optional if re-running normal mode sampling and don't want to run current step
chemrefine input.yaml --rebuild_nms
3. Monitor Progress
The tool provides detailed logging and creates organized output directories for each step:
step1/ # Conformer generation outputs
step2/ # First refinement level outputs
step3/ # Final high-level calculations
steps.csv # Summary of energies and structures
ChemRefine Operations and Engines
Operations
Operation | Description |
---|---|
OPT+SP | General optimization followed by a single-point calculation |
DOCKER | Host–guest docking workflow |
SOLVATOR | Explicit solvation for a molecule |
PES | Parse potential energy surface (PES) scan energies |
MLFF_TRAIN | Train or fine-tune a machine-learned force field (MLFF) |
Engines
1. DFT
- Description: Quantum mechanical electronic structure calculations (e.g., ORCA).
- Usable operations:
OPT+SP
,DOCKER
,SOLVATOR
,PES
2. MLFF
- Description: Machine-learned force fields (fast surrogates for DFT).
- Usable operations:
OPT+SP
,DOCKER
,SOLVATOR
,PES
,MLFF_TRAIN
(a) UMA Models
Model Variant | Task Types (Domain) |
---|---|
uma-s-1 | omol, oc20, omat, odac, omc |
uma-s1.1 | omol, oc20, omat, odac, omc |
eSEN-sm-direct | omol, oc20, omat, odac, omc |
eSEN-sm-conserving | omol, oc20, omat, odac, omc |
Task type domains:
- omol → molecules
- oc20 → catalysis
- omat → inorganic materials
- odac → MOFs
- omc → molecular crystals
(b) MACE Models
Task Type | Domain / Intended Use |
---|---|
mace_off | Mace potential trained on SPICE dataset (small,medium,large) |
mace_omol | MACE potential trained on OMol25 (extralarge model) |
mace_mp | MACE potential trained on Inorganic materials (Materials Project) |
Input Files Description
YAML Configuration File
template_dir: <location of template_files>
scratch_dir: <location of your scratch directory>
output_dir: <location of your output directory>
orca_executable: <location of your ORCA executable>
charge: 0
multiplicity: 1
steps:
- step: 1
template: "step1.inp"
operation: "GOAT"
engine: "DFT"
sampling:
method: "integer"
parameters:
count: 10
- step: 2
operation: "OPT+SP"
engine: "DFT"
charge: -1 # <--- Step-specific override
multiplicity: 2 # <--- Step-specific override
sampling:
method: "energy_window"
parameters:
window: 0.5
- step: 3
operation: "OPT+SP"
engine: "MLFF"
mlff:
model_name: "medium" # For MACE: small,medium,large for FAIRCHEM "uma-s-1"
task_name: "mace_off" # For MACE: "mace_off" or "mace_mp", for FairChem: oc20, omat, omol, odac, omc
bind: '127.0.0.1:8888' # ChemRefine uses a local server to avoid initializing the model multiple times, only adjust this if you know what you're doing.
sample_type:
method: "integer"
parameters:
num_structures: 15
method: "energy_window"
parameters:
energy: 1
unit: kcal/mol
- step: 3
operation: "SOLVATOR"
engine: "MLFF"
model_name: "uma-s-1"
task_name: "omol"
sampling:
method: "integer"
parameters:
num_structures: 1
The optional MLFF step uses a pretrained model from mace
or FairChem
. By default the
mace-off
backend with the "medium"
model is used, but you can select
different backends and models via model_name
and task_type
. With task_type you can select on what training data the model was trained on.
If a CUDA-capable GPU is detected, the MLFF optimisation runs on the GPU; otherwise it falls back to the CPU automatically.
The optional MLFF step uses a pretrained model from mace
. By default the
mace-off
backend with the "medium"
model is used, but you can select
different backends and models via foundation_model
and model_name
.
If a CUDA-capable GPU is detected, the MLFF optimisation runs on the GPU; otherwise it falls back to the CPU automatically.
To avoid downloading the model each time, set the environment variable CHEMREFINE_MLFF_CHECKPOINT
to the path of a locally downloaded checkpoint or place the file as chemrefine/models/<model>.model
within this repository.
ORCA Template Files
- First Input File (
step1.inp
): - Generally includes GOAT specifications for conformer optimization or another conformer sampler.
- Uses cheap level of theory (e.g., XTB) for initial sampling
-
Example:
! GOAT XTB
-
Subsequent Input Files (
step2.inp
,step3.inp
, etc.): - Progressive refinement with higher-level methods
- Recommended: Include frequency calculations in final step
-
Example:
! B3LYP def2-TZVP FREQ
-
Initial XYZ File (
step1.xyz
): - Starting molecular geometry
- Standard XYZ format with atom count, comment line, and coordinates
Sampling Methods
Energy Window
method: "energy_window"
parameters:
window: 0.5 # Hartrees
Selects conformers within specified energy range of the global minimum.
Boltzmann Population
method: "boltzmann"
parameters:
percentage: 95 # Cumulative population %
Selects conformers based on Boltzmann population at given temperature.
Integer Count
method: "integer"
parameters:
count: 10 # Number of conformers
Selects the N lowest-energy conformers.
Example Multi-Step Workflows
The tool supports complex multi-step refinement protocols: 1. Step 1: GOAT or other conformer generation (XTB level) 2. Step 2: Machine Learning interatomic potential optimization (uma-s-1/omol) 2. Step 3: DFT geometry optimization (B3LYP/def2-SVP) 3. Step 4: High-level single points (B3LYP/def2-TZVP + frequencies)
Resource Management
- Automatic core allocation based on ORCA PAL settings
- Intelligent job queuing to maximize cluster utilization
- Real-time monitoring of SLURM job status
Project Structure
chemrefine/
├── src/chemrefine # Main package code
├── Examples/ # Example input files and SLURM scripts
├── README.md # This file
├── LICENSE # License
└── pyproject.toml # Package configuration
Contributing
We welcome contributions! Please:
1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Submit a pull request
Citation
If you use ChemRefine in your research, please cite:
@software{ChemRefine,
title={ChemRefine},
author={Ignacio Migliaro,Markus G.S. Weiss,Alistair J. Sterling},
url={https://doi.org/10.26434/chemrxiv-2025-cvg1x},
year={2025}
}
License
This project is licensed under the GNU AFFERO GENERAL PUBLIC LICENSE- see the LICENSE file for details.
Support
For questions, issues, or feature requests: - 📧 Email: ignacio.migliaro@utdallas.edu - 🐛 Issues: GitHub Issues - 📖 Documentation: README.md