A comprehensive benchmarking framework for evaluating material generation models across multiple metrics including validity, distribution, diversity, novelty, uniqueness, and stability.
# Clone the repository
git clone https://github.com/LeMaterial/lemat-genbench.git
cd lemat-genbench
# Install dependencies
uv sync
# On macOS/Linux:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate
# Set up UMA access (required for stability and distribution benchmarks)
huggingface-cli login
# Run a quick benchmark
uv run scripts/run_benchmarks.py --cifs notebooks --config comprehensive --name quick_test- Python 3.11+
- uv package manager (recommended)
- HuggingFace account (for UMA model access)
-
Clone the repository:
git clone https://github.com/LeMaterial/lemat-genbench.git cd lemat-genbench -
Install dependencies:
uv sync
-
Activate the virtual environment:
# On macOS/Linux: source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Set up UMA model access (required for stability and distribution benchmarks):
# Request access to UMA model on HuggingFace # Visit: https://huggingface.co/facebook/UMA # Click "Request access" and wait for approval # Login to HuggingFace CLI huggingface-cli login # Enter your HuggingFace token when prompted
The UMA model is gated and requires special access. Follow these steps:
-
Request Access:
- Visit UMA model page
- Click "Request access" button
- Wait for approval (usually within 24 hours)
-
Get HuggingFace Token:
- Go to HuggingFace Settings
- Create a new token with "read" permissions
- Copy the token
-
Login via CLI:
huggingface-cli login # Enter your token when prompted -
Verify Access:
# Test UMA access uv run scripts/run_benchmarks.py --cifs notebooks --config comprehensive --name uma_test --families stability
- Charge Neutrality: Ensures structures are charge-balanced using oxidation state analysis and bond valence calculations
- Minimum Interatomic Distance: Validates that atomic distances exceed minimum thresholds based on atomic radii
- Coordination Environment: Checks if coordination numbers match expected values for each element
- Physical Plausibility: Validates density, lattice parameters, crystallographic format, and symmetry
- Jensen-Shannon Distance (JSD): Measures similarity of categorical properties (space groups, crystal systems, elemental compositions) between generated and reference materials
- Maximum Mean Discrepancy (MMD): Measures similarity of continuous properties (volume, density) between generated and reference materials using kernel methods
- FrΓ©chet Distance: Measures similarity of learned structural representations (embeddings) from MLIPs (ORB, MACE, UMA) between generated and reference materials
- Element Diversity: Measures variety of chemical elements used across generated structures using Vendi scores and Shannon entropy
- Space Group Diversity: Measures variety of crystal symmetries (space groups) present in generated structures
- Site Number Diversity: Measures variety in the number of atomic sites per structure
- Physical Size Diversity: Measures variety in physical properties (density, lattice parameters, packing factor) compared to uniform distribution
- Novelty Ratio: Fraction of generated structures NOT present in LeMat-Bulk reference dataset
- BAWL Fingerprinting: Uses BAWL structure hashing to efficiently compare against ~5M known materials
- Structure Matcher: Alternative method using pymatgen StructureMatcher for structural comparison
- Reference Comparison: Measures how many structures are truly novel vs. known materials
- Uniqueness Ratio: Fraction of unique structures within the generated set (internal diversity)
- BAWL Fingerprinting: Uses BAWL structure hashing to identify duplicate structures efficiently
- Structure Matcher: Alternative method using pymatgen StructureMatcher for structural comparison
- Duplicate Detection: Counts and reports duplicate structures within the generated set
- Stability Ratio: Fraction of structures with energy above hull β€ 0 eV/atom (thermodynamically stable)
- Metastability Ratio: Fraction of structures with energy above hull β€ 0.1 eV/atom (metastable)
- Mean E_Above_Hull: Average energy above hull across multiple MLIPs (ORB, MACE, UMA)
- Formation Energy: Average formation energy across multiple MLIPs (ORB, MACE, UMA)
- Relaxation Stability: RMSE between original and relaxed atomic positions
- Ensemble Statistics: Mean and standard deviation across MLIP predictions for uncertainty quantification
- Production HHI: Measures supply risk based on concentration of element production sources (market concentration)
- Reserve HHI: Measures long-term supply risk based on concentration of element reserves (geographic distribution)
- SUN Rate: Fraction of structures that are simultaneously stable (e_above_hull β€ 0), unique, and novel
- MetaSUN Rate: Fraction of structures that are simultaneously metastable (e_above_hull β€ 0.1), unique, and novel
- Combined Rate: Fraction of structures that are either stable or metastable, unique, and novel
- Efficient Computation: Uses hierarchical filtering (uniqueness β novelty β stability) for optimal performance
# Run all benchmark families on CIF files in a directory
uv run scripts/run_benchmarks.py --cifs /path/to/cif/directory --config comprehensive --name my_benchmark
# Run specific benchmark families
uv run scripts/run_benchmarks.py --cifs structures.txt --config comprehensive --families validity novelty --name custom_run
# Use a file list instead of directory
uv run scripts/run_benchmarks.py --cifs my_structures.txt --config comprehensive --name file_list_run
# Load structures from CSV file
uv run scripts/run_benchmarks.py --csv my_structures.csv --config comprehensive --name csv_benchmark
# Run specific families on CSV input
uv run scripts/run_benchmarks.py --csv structures.csv --config comprehensive --families validity diversity --name csv_quick_test
# Use structure-matcher for fingerprinting (alternative to BAWL)
uv run scripts/run_benchmarks.py \
--cifs submissions/test \
--config comprehensive_structure_matcher \
--name test_run_structure_matcher \
--fingerprint-method structure-matcher# Point to a directory containing CIF files
uv run scripts/run_benchmarks.py --cifs /path/to/cif/directory --config comprehensive --name my_runCreate a text file with CIF paths:
# my_structures.txt
path/to/structure1.cif
path/to/structure2.cif
path/to/structure3.cifThen run:
uv run scripts/run_benchmarks.py --cifs my_structures.txt --config comprehensive --name my_runLoad structures directly from a CSV file containing structure data:
# Load from CSV file
uv run scripts/run_benchmarks.py --csv my_structures.csv --config comprehensive --name my_csv_runCSV Format Requirements:
- Must contain a column named
structure,LeMatStructs, orcif_string - The structure column should contain either:
- JSON strings (pymatgen Structure dictionaries) - recommended
- CIF strings (CIF format text)
Example CSV format:
material_id,structure,other_metadata
0,"{""@module"": ""pymatgen.core.structure"", ""@class"": ""Structure"", ""lattice"": {...}, ""sites"": [...]}",metadata1
1,"{""@module"": ""pymatgen.core.structure"", ""@class"": ""Structure"", ""lattice"": {...}, ""sites"": [...]}",metadata2
Note: You can only use one input method at a time (--cifs OR --csv, not both).
--cifs: Path to directory or file list (use with--cifsOR--csv)--csv: Path to CSV file containing structures (use with--cifsOR--csv)--config: Configuration name (default:comprehensive)--name: Name for this benchmark run (required)--families: Specific benchmark families to run (optional, defaults to all)--fingerprint-method: Fingerprinting method to use (bawl,short-bawl,structure-matcher,pdd)
| Family | Description | Computational Cost |
|---|---|---|
validity |
Fundamental structure validation (charge, distance, plausibility) | Low |
distribution |
Distribution similarity (JSD, MMD, FrΓ©chet distance) | Medium |
diversity |
Structural diversity (element, space group, site number, physical) | Low |
novelty |
Novelty vs. LeMat-Bulk reference dataset | Medium |
uniqueness |
Internal uniqueness within generated set | Low |
stability |
Thermodynamic stability (formation energy, e_above_hull) | High |
hhi |
Supply risk assessment (production/reserve concentration) | Low |
sun |
Composite metric (Stability + Uniqueness + Novelty) | High |
comprehensive.yaml- All benchmark families using BAWL fingerprinting (default)comprehensive_structure_matcher.yaml- All benchmark families using structure-matchercomprehensive_new.yaml- Enhanced benchmarks with augmented fingerprintingvalidity.yaml- Validity metrics onlydistribution.yaml- Distribution metrics onlydiversity.yaml- Diversity metrics onlynovelty.yaml- Novelty metrics onlyuniqueness.yaml- Uniqueness metrics onlystability.yaml- Stability metrics onlyhhi.yaml- HHI metrics onlysun.yaml- SUN metrics only
| Method | Description | Speed | Memory Usage |
|---|---|---|---|
bawl |
Full BAWL fingerprinting | Fast | Low |
short-bawl |
Shortened BAWL fingerprinting (default) | Fast | Low |
structure-matcher |
PyMatGen StructureMatcher comparison | Slow | High |
pdd |
Packing density descriptor | Medium | Medium |
# Run only validity checks
uv run scripts/run_benchmarks.py --cifs structures/ --config validity --families validity --name validity_only
# Run only stability analysis
uv run scripts/run_benchmarks.py --cifs structures/ --config stability --families stability --name stability_only# Run validity and novelty (low + medium cost)
uv run scripts/run_benchmarks.py --cifs structures/ --config comprehensive --families validity novelty --name validity_novelty
# Run diversity, uniqueness, and HHI (all low cost)
uv run scripts/run_benchmarks.py --cifs structures/ --config comprehensive --families diversity uniqueness hhi --name diversity_analysis
# Run distribution and stability (medium + high cost)
uv run scripts/run_benchmarks.py --cifs structures/ --config comprehensive --families distribution stability --name distribution_stability
# Run novelty, uniqueness, and SUN (medium + low + high cost)
uv run scripts/run_benchmarks.py --cifs structures/ --config comprehensive --families novelty uniqueness sun --name novelty_sun# Run all benchmark families
uv run scripts/run_benchmarks.py --cifs structures/ --config comprehensive --name full_analysis
# or explicitly specify all families
uv run scripts/run_benchmarks.py --cifs structures/ --config comprehensive --families validity distribution diversity novelty uniqueness stability hhi sun --name explicit_full# Use structure-matcher instead of BAWL fingerprinting for more accurate structural comparison
uv run scripts/run_benchmarks.py \
--cifs submissions/test \
--config comprehensive_structure_matcher \
--name test_run_structure_matcher \
--fingerprint-method structure-matcher
# Run specific families with structure-matcher
uv run scripts/run_benchmarks.py \
--cifs structures/ \
--config comprehensive \
--families novelty uniqueness \
--name novelty_uniqueness_matcher \
--fingerprint-method structure-matcherResults are saved to results/ directory with format:
{run_name}_{config_name}_{timestamp}.json
Example: my_benchmark_comprehensive_20241204_143022.json
{
"run_info": {
"run_name": "my_benchmark",
"config_name": "comprehensive",
"timestamp": "20241204_143022",
"n_structures": 100,
"benchmark_families": ["validity", "distribution", "diversity", ...]
},
"results": {
"validity": { ... },
"distribution": { ... },
"diversity": { ... },
...
}
}uv run scripts/run_benchmarks.py --cifs notebooks --config validity --name quick_validityuv run scripts/run_benchmarks.py --cifs my_structures/ --config stability --name stability_analysisuv run scripts/run_benchmarks.py --cifs structures.txt --config comprehensive --families validity novelty uniqueness --name custom_analysis# Quick validation of CSV structures
uv run scripts/run_benchmarks.py --csv my_structures.csv --config validity --name csv_validity
# Full analysis of CSV structures
uv run scripts/run_benchmarks.py --csv generated_structures.csv --config comprehensive --name csv_full_analysis
# Distribution analysis only
uv run scripts/run_benchmarks.py --csv structures.csv --config distribution --families distribution --name csv_distribution# Use SSH-optimized script for large datasets
uv run scripts/run_benchmarks_ssh.py --cifs large_dataset/ --config comprehensive --name large_run
# Structure-matcher with SSH optimization
uv run scripts/run_benchmarks_ssh.py \
--cifs submissions/large_test \
--config comprehensive_structure_matcher \
--name large_test_structure_matcher \
--fingerprint-method structure-matcher- MMD Reference Sample: Uses 15K samples from LeMat-Bulk for computational efficiency
- MLIP Models: Requires significant computational resources for stability benchmarks
- Memory Usage: Large structure sets may require substantial RAM
- Structure-Matcher: More accurate but computationally expensive than BAWL fingerprinting
- UMA Model: Requires HuggingFace access approval
- ORB Models: Automatically downloaded on first use
- MACE Models: Cached locally after first download
- Formation Energy: Works with charged species (Cs+, Br-, etc.)
- E_above_hull: May fail for charged species (expected behavior)
- Warnings: Some warnings are informational and expected
- Small Sets: Use
--familiesto run only needed benchmarks - Large Sets: Consider running benchmarks separately for memory efficiency
- Caching: Models are cached locally for faster subsequent runs
- SSH Optimization: Use
run_benchmarks_ssh.pyfor high-core environments - Fingerprinting: Use
structure-matcherfor accuracy,short-bawlfor speed
-
UMA Access Denied:
# Ensure you're logged in huggingface-cli login # Check access status huggingface-cli whoami
-
Memory Issues:
# Run fewer families at once uv run scripts/run_benchmarks.py --cifs structures/ --config validity --families validity --name memory_test -
Timeout Errors:
- Reduce structure count
- Use faster MLIP models (ORB instead of UMA)
- Increase timeout in configuration
-
Private Dataset Access Error:
# Error: 'Entalpic/LeMaterial-Above-Hull-dataset' doesn't exist on the Hub # Solution: Download datasets locally (one-time setup) uv run scripts/download_above_hull_datasets.py
This downloads the required datasets to
data/folder for local access. -
Structure-Matcher Performance:
- Structure-matcher is more accurate but much slower than BAWL
- Consider using for smaller datasets or when accuracy is critical
- Use SSH-optimized script for large datasets
- Check the scripts documentation
- Review example configurations in
src/config/ - Examine test files for usage patterns
- LeMat-Bulk Dataset: HuggingFace - Siron, Martin, et al. "LeMat-Bulk: aggregating, and de-duplicating quantum chemistry materials databases." AI for Accelerated Materials Design-ICLR 2025.
- UMA Model: HuggingFace - Wood, Brandon M., et al. "UMA: A Family of Universal Models for Atoms." arXiv preprint arXiv:2506.23971 (2025).
- ORB Models: GitHub - Rhodes, Benjamin, et al. "Orb-v3: atomistic simulation at scale." arXiv preprint arXiv:2504.06231 (2025).
- MACE Models: GitHub - Batatia, Ilyes, et al. "MACE: Higher order equivariant message passing neural networks for fast and accurate force fields." Advances in neural information processing systems 35 (2022): 11423-11436.
- FrΓ©chet Distance: FCD Implementation - Measures similarity between embedding distributions
- Maximum Mean Discrepancy (MMD): Gretton et al. (2012) - Gretton, Arthur, et al. "A kernel two-sample test." The journal of machine learning research 13.1 (2012): 723-773.
- Jensen-Shannon Distance: Lin (1991) - Lin, Jianhua. "Divergence measures based on the Shannon entropy." IEEE Transactions on Information theory 37.1 (2002): 145-151.
- Vendi Score: Friedman & Dieng (2023) - Friedman, Dan, and Adji Bousso Dieng. "The vendi score: A diversity evaluation metric for machine learning." arXiv preprint arXiv:2210.02410 (2022).
- Herfindahl-Hirschman Index (HHI): Mansouri Tehrani, Aria, et al - Mansouri Tehrani, Aria, et al. "Balancing mechanical properties and sustainability in the search for superhard materials." Integrating materials and manufacturing innovation 6.1 (2017): 1-8.
This project is licensed under the Apache License - see the LICENSE file for details.
