Goldilocks K-Points: ML Models for Predicting K-Point Density in DFT Calculations

How do you choose parameters for your DFT calculations?

This package provides machine learning models to predict optimal k-point density (k-dist) for SCF total energy calculations with plane-wave DFT codes for inorganic 3D materials. All models take as input the structure and/or composition of the compound and output k-dist, which is expected to guarantee convergence of total energy calculations while minimizing computational time.

Overview

The package implements multiple machine learning approaches for predicting k-point density:

Graph Neural Networks (GNNs): CGCNN and ALIGNN models that learn from crystal structures
Transformer Models: CrabNet for composition-based predictions
Ensemble Methods: Random Forest, Gradient Boosting, and Histogram Gradient Boosting

The models support both regression and classification tasks, with advanced features for uncertainty quantification including robust regression, quantile regression, and conformal prediction.

Features

Implemented Models

CGCNN - Crystal Graph Convolutional Neural Network (paper)
ALIGNN - Atomistic Line Graph Neural Network (paper)
CrabNet - Transformer-based model for composition-based predictions (paper)
Random Forest - Ensemble method with quantile regression support (scikit-learn, sklearn-quantile)
Gradient Boosting Trees - XGBoost-style gradient boosting with quantile regression support (scikit-learn implementation)
Histogram Gradient Boosting - Fast gradient boosting implementation

Atomic (Node) Features

CGCNN features - Standard atomic embeddings from CGCNN paper
CGCNN features modified with energy and density cutoff - In addition to CGCNN feature the follwoing features added: 1-hot encoding for energy cutoff, 1-hot encoding for density cutoff, type of pseudopotential. PCA is performed on features to remove dimensions with no infromation content.
mat2vec features - Mat2vec embeddings were developed by Tshitoyan et al. via skip-gram variation of Word2vec method trained on 3.3 million scientific abstracts, and originally used in CrabNet model
SOAP features - Calculated for structures with all atoms substituted by one atom type -- not used as was not effective as atomic features

Compound-Level Features

Matminer composition features - Element property, stoichiometry, and valenceorbital descriptors
Matminer structure features - Global symmetry and density descriptors
JarvisCFID features - JARVIS Crystal Fingerprint features, matminer implementation
SOAP features - Averaged over all atoms in the structure, calculated with DScribe
CGCNN embeddings - Features extracted from pre-trained CGCNN models. Pre-trained CGCNN model was trained on MP 'is_metal' dataset (Autumn 2025)
MatSciBert embeddings - Generated from:
- QE SCF input files with k-points section removed, or
- Robocrystallographer structure descriptions

Graph Construction Options

Radius graph - All atoms within a cutoff radius are considered neighbors
CrystalNN graph - Uses CrystalNN algorithm to identify nearest neighbors based on chemical environment

Loss Functions (for Uncertainty Estimation)

RobustL2 Loss - Gaussian distribution-based robust loss
RobustL1 Loss - Laplace distribution-based robust loss
StudentT Loss - Student's t-distribution with configurable degrees of freedom
Quantile Loss - Single quantile prediction
IntervalScoreLoss - Interval prediction with coverage guarantees

Architecture Highlights

For GNN models, atomic features are used as input to the graph neural network, and compound-level features are concatenated to the features produced by the GNN encoder. This hybrid approach enables:

Transfer learning: Leveraging pre-trained models for feature extraction
Better structure learning: Addressing limitations of GNNs in learning certain structural features
Domain knowledge integration: Incorporating metallicity and other important predictors

Installation

conda/minconda/micromamba installed

Create environment with python3.11 or 3.12

conda create -n goldilocks_kpoints python=3.11
conda activate goldilocks_kpoints

Install poetry

conda install poetry

Install PyTorch Geometric and its dependencies

(torch_scatter, torch_sparse, etc. must be installed from binary wheels using pip and cannot be installed with Poetry).

pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cpu.html 
pip install torch_geometric

Install everything else with poetry

poetry install --no-root

Quick Start

1. Prepare Your Data in the CGCNN format (see their paper for details)

Create a CSV file (id_prop.csv) with two columns:

Column 0: Sample IDs (corresponding to {id}.cif files)
Column 1: Target values (k-dist values)

Place your CIF files in the same directory as the CSV file. By defalt the data is expected to be stored in 'data/your_project_name/' folder (included in .gitignore)

2. Using the Goldilocks data https://data-collections.psdi.ac.uk/records/75959-bwa52 :

If you download the Goldilocks dataset, and unpack it into 'data/' you will see a 'data/upload_version' folder with 'summary.csv' and 'structure_calc_details' folder inside. WIth this setup you can use 'data_preprocessing/goldilocks-pre-processing.py' to turn the data into 'data/goldilocks/' folder with CGCNN-format data ready for model training.

python data_preprocessing/goldilocks-pre-processing.py --data_folder 'data/upload_version/structure_calc_details/' --data_file 'data/upload_version/summary.csv' --target_folder 'data/goldilocks'

3. Configure Your Experiment

Create or modify a configuration file in configs/ directory. Example configurations:

configs/cgcnn.yaml - For CGCNN model
configs/alignn.yaml - For ALIGNN model
configs/crabnet.yaml - For CrabNet model
configs/ensembles.yaml - For ensemble models

3. Train a Model

python scripts/train.py --config_file configs/cgcnn.yaml

ALIGNN LMDB: When building the ALIGNN LMDB, single-atom primitive cells (e.g. Cu) are automatically expanded with a 2×2×2 diagonal supercell so periodic bonds get distinct site indices and the line graph is non-empty. You get a UserWarning and a [alignn_graph] printed line (formula and site count). Regenerate your ALIGNN LMDB after pulling this change (lmdb_exist: false in the config, or delete existing LMDB files) so graphs include the supercell.

4. Make Predictions

python scripts/predict.py \
    --config_file configs/cgcnn.yaml \
    --checkpoint_path trained_models/cgcnn/ \
    --output_name output/predictions.csv

5. Conformalised quantile regression

To perform Conformalised quantile regression, first train quantile models using quantile loss, or QRF (for ranfom Forest). Then use the notebooks (availibel for RF and ALIGNN, but can be easily modified for other models) to calculate conformal corrections to the intervals.

Model Training

Models are typically trained for 300 epochs with:

Early stopping: Monitors validation loss/metrics
Stochastic Weight Averaging (SWA): Optional, can be enabled

Training Options

Classification: Multi-class classification with class weights
Regression: Standard mean squared error or mean absolute error
Robust Regression: Estimates aleatoric uncertainty (predicts mean and std)
Quantile Regression: Predicts specific quantiles or intervals

Dataset and Convergence Definition

The dataset and convergence definition were provided by Junwen Yin.

Calculation Parameters

All data was generated with fixed parameters:

Code: Quantum Espresso
Pseudopotentials: SSSP1.3_PBESol_efficiency library
Energy cutoffs: Recommended values for SSSP1.3_PBESol_efficiency
Smearing: Cold smearing with degauss=0.01 Ry
Magnetism: All compounds treated as non-magnetic

Convergence Criterion

A calculation is considered converged if the total energy change for 3 consecutive k-meshes with increasing number of points is within 1 meV/atom.

Project Structure

goldilocks_kpoints/
├── configs/              # Configuration files for different models
|   ├── cgcnn.yaml
|   ├── alignn.yaml
|   ├── ensembles.yaml
|   └── crabnet.json
├── data/                 # Data directory (CIF files, CSV files)
├── trained_models/       # The place to store trained models
├── outputs/              # The place to write outputs to
├── embeddings/
|   ├── atom_init_original.json
|   ├── atom_init_with_sssp_cutoffs.json
|   └── mat2vec.json
├── datamodules/          # PyTorch Lightning data modules
│   ├── gnn_datamodule.py
│   ├── crabnet_datamodule.py
│   └── lmdb_dataset.py
├── models/               # Model implementations
│   ├── cgcnn.py
│   ├── alignn.py
│   ├── crabnet.py
│   ├── ensembles.py
│   └── modelmodule.py
├── utils/                # Utility functions
│   ├── atom_features_utils.py
│   ├── compound_features_utils.py
│   ├── cgcnn_graph.py
|   ├── crabnet_utils.py
│   ├── alignn_graph.py
│   ├── utils.py
|   └── trained_is_metal_cgcnn
|        └──is_metal.ckpt
├── scripts/              # Training and prediction scripts
│   ├── train.py
│   └── predict.py
├── notebooks/
|   ├── Data-exploration.ipynb
|   ├── RF-feature-importance.ipynb
|   ├── Surrogate-models.ipynb
|   ├── ALIGNN-CQR.ipynb
|   ├── RF-CQR.ipynb
|   └── Wall-time.ipynb
└── README.md

Citation

If you use this code in your research, please cite:

@article{goldilocks_kpoints,
  title = {Automatic generation of input files with optimised k-point meshes for Quantum Espresso self-consistent field single point total energy calculations},
  author = {Elena Patyukova, Junwen Yin, Susmita Basak, Jaehoon Cha, Alin Elenaa, and Gilberto Teobaldi},
  year = {2025},
  url = {to be published}
}

License

This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
configs		configs
data_preprocessing		data_preprocessing
datamodules		datamodules
embeddings		embeddings
figures		figures
models		models
notebooks		notebooks
scripts		scripts
utils		utils
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Goldilocks K-Points: ML Models for Predicting K-Point Density in DFT Calculations

Overview

Features

Implemented Models

Atomic (Node) Features

Compound-Level Features

Graph Construction Options

Loss Functions (for Uncertainty Estimation)

Architecture Highlights

Installation

Create environment with python3.11 or 3.12

Install poetry

Install PyTorch Geometric and its dependencies

Install everything else with poetry

Quick Start

1. Prepare Your Data in the CGCNN format (see their paper for details)

2. Using the Goldilocks data https://data-collections.psdi.ac.uk/records/75959-bwa52 :

3. Configure Your Experiment

3. Train a Model

4. Make Predictions

5. Conformalised quantile regression

Model Training

Training Options

Dataset and Convergence Definition

Calculation Parameters

Convergence Criterion

Project Structure

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages