NucleusDiff: Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design
This repository is the official implementation of the paper "Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design".
[Caltech News][Project Page] [Paper]
Authors: Shengchao Liu*, Liang Yan*, Weitao Du, Weiyang Liu, Zhuoxinran Li, Hongyu Guo, Christian Borgs, Jennifer Chayes, Anima Anandkumar
(*: Equal Contribution)
Proceedings of the National Academy of Sciences 2025 (PNAS 2025)
The code has been tested in the following environment:
| Package | Version |
|---|---|
| Python | 3.8.13 |
| PyTorch | 1.12.1 |
| CUDA | 11.0 |
| PyTorch Geometric | 2.5.2 |
| RDKit | 2021.03.1b1 |
Install via Conda and Pip:
conda create -n "nucleusdiff" python=3.8.13
source activate nucleusdiff
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install torch_geometric
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/pyg_lib-0.3.1%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_cluster-1.6.0%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_scatter-2.1.0%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_sparse-0.6.16%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_spline_conv-1.2.1%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
conda install rdkit/label/nightly::rdkit
conda install openbabel tensorboard pyyaml easydict python-lmdb -c conda-forge
pip install wandb
pip install pytorch-lightning==2.1.3
pip install matplotlib
pip install numpy==1.23
pip install accelerate
pip install transformers
# For Vina Docking
pip install meeko==0.1.dev3 scipy pdb2pqr vina==1.2.2
python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3The code should work with PyTorch >= 1.9.0 and PyG >= 2.0. You can change the package version according to your need.
1.2 Dependency for preprocessing the crossdock manifold data (we present this special environment if you want to process the manifold dataset from scratch.)
# We recommend using conda for environment management
conda create -n Manifold python=3.7.3
conda activate Manifold
pip install -r ./crossdock_manifold_data_preparation/requirements.txt
# install PyMesh for surface mesh processing
PYMESH_PATH="~/PyMesh" # substitute with your own PyMesh path
git clone https://github.com/PyMesh/PyMesh.git $PYMESH_PATH
cd $PYMESH_PATH
git submodule update --init
apt-get update
# make sure you have these libraries installed before building PyMesh
apt-get install cmake libgmp-dev libmpfr-dev libgmpxx4ldbl libboost-dev libboost-thread-dev libopenmpi-dev
cd $PYMESH_PATH/third_party
python build.py all # build third party dependencies
cd $PYMESH_PATH
mkdir build
cd build
cmake ..
make -j # check for missing third-party dependencies if failed to make
cd $PYMESH_PATH
python setup.py install
python -c "import pymesh; pymesh.test()"
# install meshplot
conda install -c conda-forge meshplot
# install libigl
conda install -c conda-forge igl
# download MSMS
MSMS_PATH="~/MSMS" # substitute with your own MSMS path
wget https://ccsb.scripps.edu/msms/download/933/ -O msms_i86_64Linux2_2.6.1.tar.gz
mkdir -p $MSMS_PATH # mark this directory as your $MSMS_bin for later use
tar zxvf msms_i86_64Linux2_2.6.1.tar.gz -C $MSMS_PATH
# install PyTorch 1.10.0 (e.g., with CUDA 11.3)
conda install pytorch==1.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
# install Manifold
pip install -e . -
The data used for training / evaluating the model are organized in the nucleusdiff_data_and_checkpoint Google Drive folder.
-
To train the model from scratch, you need to download the preprocessed lmdb file and split file:
crossdocked_v1.1_rmsd1.0_pocket10_processed_w_manifold_data_version.lmdbcrossdocked_pocket10_pose_w_manifold_data_split.pt
-
To evaluate the model on the test set, you need to download and unzip the
test_set.zip. It includes the original PDB files that will be used in Vina Docking. -
If you want to process the dataset from scratch, you need to download CrossDocked2020 v1.1 from here, save it into
./data/CrossDocked2020, and run the scripts in./crossdock_data_preparation:
-
- clean_crossdocked.py will filter the original dataset and keep the ones with RMSD < 1A.
It will generate a
index.pklfile and create a new directory containing the original filtered data (corresponds tocrossdocked_v1.1_rmsd1.0.tar.gzin the drive). You don't need these files if you have downloaded .lmdb file.
- clean_crossdocked.py will filter the original dataset and keep the ones with RMSD < 1A.
It will generate a
python ./crossdock_data_preparation/step1_clean_crossdocked.py \
--source "./data/CrossDocked2020" \
--dest "./data/crossdocked_v1.1_rmsd1.0" \
--rmsd_thr 1.0-
- extract_pockets.py will clip the original protein file to a 10A region around the binding molecule. E.g.
python ./crossdock_data_preparation/step2_extract_pockets.py \
--source "./data/crossdocked_v1.1_rmsd1.0" \
--dest "./data/crossdocked_v1.1_rmsd1.0_pocket10"-
- split_pl_dataset.py will split the training and test set. We use the same split
split_by_name.ptas AR and Pocket2Mol, which can also be downloaded in the Google Drive - data folder.
- split_pl_dataset.py will split the training and test set. We use the same split
python ./crossdock_data_preparation/step3_split_pl_dataset.py \
--path "./data/crossdocked_v1.1_rmsd1.0_pocket10" \
--dest "./data/crossdocked_pocket10_pose_split.pt" \
--fixed_split "./data/split_by_name.pt"- switch conda virtual environments
source activate Manifold- prepare input for MSMS
python step1_convert_npz_to_xyzrn.py \
--crossdock_source [path/to/crossdock_pocket10_auxdata/] \
--out_root "./data/crossdocked_pocket10_mesh"- execute MSMS to generate molecular surface
python step2_compute_msms.py \
--data_root "./data/crossdocked_pocket10_mesh" \
--msms-bin [path/to/MSMS/dir]/msms.x86_64Linux2.2.6.1 - refine surface mesh
python step3_refine_mesh.py \
--data_root "./data/crossdocked_pocket10_mesh"python ./datasets/pl_pair_dataset.py \
--data_root "./data/crossdocked_v1.1_rmsd1.0_pocket10"python train.py \
--lr 0.001 \
--device "cuda:0" \
--wandb_project_name "nucleusdiff_train" \
--loss_mesh_constained_weight 1Notice: our pretrained model are organized in the nucleusdiff_data_and_checkpoint Google Drive folder.
python sample_for_crossdock.py \
--ckpt_path "./logs_diffusion/nucleusdiff_train" \
--ckpt_it 100000 \
--cuda_device 0 \
--data_id 0 You can also speed up sampling with multiple GPUs, e.g.:
python sample_for_crossdock.py \
--ckpt_path "./logs_diffusion/nucleusdiff_train" \
--ckpt_it 100000 \
--cuda_device 0 \
--data_id 0
python sample_for_crossdock.py \
--ckpt_path "./logs_diffusion/nucleusdiff_train" \
--ckpt_it 100000 \
--cuda_device 1 \
--data_id 1
python sample_for_crossdock.py \
--ckpt_path "./logs_diffusion/nucleusdiff_train" \
--ckpt_it 100000 \
--cuda_device 2 \
--data_id 2
python sample_for_crossdock.py \
--ckpt_path "./logs_diffusion/nucleusdiff_train" \
--ckpt_it 100000 \
--cuda_device 3 \
--data_id 3 python ./evaluation/evaluate_for_crossdock_on_collision_metrics.py \
--sample_path "./result_output" \
--eval_step -1 \
--protein_root "./data/test_set" \
--docking_mode "vina_dock"python ./evaluation/evaluate_for_crossdock_on_collision_metrics.py \
--sample_path "./result_output" \
--eval_step -1If you want to process the dataset from scratch, you need to download real_world.zip from nucleusdiff_data_and_checkpoint, save it into ./data, and run the scripts in ./covid_19_data_preparation:
python ./covid_19_data_preparation/extract_pockets_for_real_world.py \
--source "./data/real_world" \
--dest "./real_world_test_extract_pockets"python sample_for_covid_19.py \
--checkpoint [path/to/nucleusdiff/checkpoint] \
--pdb_path "./real_world_test_extract_pockets/CDK2/cdk2_ligand_pocket10.pdb" \
--result_path "./read_world_cdk2_test" \
--sample_num_atoms "real_world_testing" \
--inference_num_atoms 30python ./evaluation/evaluate_for_covid_19_on_general_metrics.py \
--sample_path "./read_world_cdk2_test" \
--protein_root "./real_world/cdk2_processed.pdb" \
--ligand_filename "CDK2" \
--docking_mode "vina_dock"python ./evaluation/evaluate_for_covid_19_on_collision_metrics.py \
--sample_path "./read_world_cdk2_test" \
--model "nucleusdiff_train" \
--target "cdk2_test"Use sample_for_specific_protein.py to generate ligands for an arbitrary single protein pocket PDB.
- Prepare a pocket PDB centered at the binding site (e.g., 10 Å around the ligand or binding residues).
You may reuse the script in 4.1:./covid_19_data_preparation/extract_pockets_for_real_world.py. - Example pocket file:
./specific_protein/3cl_ligand_pocket10.pdb.
python sample_for_specific_protein.py \
--checkpoint ./checkpoints/nucleusdiff_pretrained_model.pt \
--pdb_path ./specific_protein/3cl_ligand_pocket10.pdb \
--result_path ./results_specific_protein \
--sample_num_atoms real_world_testing \
--inference_num_atoms 30 \
--num_samples 1000 \
--num_steps 1000 \
--device cuda:0Key arguments:
--checkpoint: path to a NucleusDiff checkpoint (.pt).--pdb_path: pocket PDB for your target protein.--result_path: output directory.--sample_num_atoms: set toreal_world_testingto use a fixed atom count.--inference_num_atoms: atoms per generated ligand when usingreal_world_testing.--num_samples: number of ligands to generate.--num_steps: diffusion steps (trade-off between quality and speed).--device: GPU device, e.g.,cuda:0.
${result_path}/sample_{test_time}.pt: raw tensors and sampling trajectories.${result_path}/sdf/*.sdf: reconstructed molecules in SDF format.
Run python sample_for_specific_protein.py --help for the complete list of options and defaults.
Feel free to cite this work if you find it useful to you!
@article{liu2025manifold,
title={Manifold-constrained nucleus-level denoising diffusion model for structure-based drug design},
author={Liu, Shengchao and Yan, Liang and Du, Weitao and Liu, Weiyang and Li, Zhuoxinran and Guo, Hongyu and Borgs, Christian and Chayes, Jennifer and Anandkumar, Anima},
journal={Proceedings of the National Academy of Sciences},
volume={122},
number={41},
pages={e2415666122},
year={2025},
publisher={National Academy of Sciences}
}