NucleusDiff: Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design

This repository is a copy of the official implementation of Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design (Arxiv 2024).

[Paper] [Project Page]

Authors: Shengchao Liu*, Divin Yan*, Weitao Du, Weiyang Liu, Zhuoxinran Li, Hongyu Guo, Christian Borgs*, Jennifer Chayes*, Anima Anandkumar*

1. Installation

1.1 Dependency for main experiment

The code has been tested in the following environment:

Package	Version
Python	3.8.13
PyTorch	1.12.1
CUDA	11.0
PyTorch Geometric	2.5.2
RDKit	2021.03.1b1

Install via Conda and Pip:

conda create -n "nucleusdiff" python=3.8.13
source activate nucleusdiff
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install torch_geometric
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/pyg_lib-0.3.1%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_cluster-1.6.0%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_scatter-2.1.0%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_sparse-0.6.16%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_spline_conv-1.2.1%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
conda install rdkit/label/nightly::rdkit
conda install openbabel tensorboard pyyaml easydict python-lmdb -c conda-forge
pip install wandb
pip install pytorch-lightning==2.1.3
pip install matplotlib
pip install numpy==1.23
pip install accelerate
pip install transformers


# For Vina Docking
pip install meeko==0.1.dev3 scipy pdb2pqr vina==1.2.2
python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3

The code should work with PyTorch >= 1.9.0 and PyG >= 2.0. You can change the package version according to your need.

1.2 Dependency for preprocessing the crossdock manifold data (we present this special environment if you want to process the manifold dataset from scratch.)

# We recommend using conda for environment management
conda create -n Manifold python=3.7.3
conda activate Manifold

pip install -r ./crossdock_manifold_data_preparation/requirements.txt
# install PyMesh for surface mesh processing
PYMESH_PATH="~/PyMesh" # substitute with your own PyMesh path
git clone https://github.com/PyMesh/PyMesh.git $PYMESH_PATH 
cd $PYMESH_PATH 
git submodule update --init
apt-get update
# make sure you have these libraries installed before building PyMesh
apt-get install cmake libgmp-dev libmpfr-dev libgmpxx4ldbl libboost-dev libboost-thread-dev libopenmpi-dev
cd $PYMESH_PATH/third_party
python build.py all # build third party dependencies
cd $PYMESH_PATH
mkdir build
cd build
cmake ..
make -j # check for missing third-party dependencies if failed to make
cd $PYMESH_PATH
python setup.py install
python -c "import pymesh; pymesh.test()"

# install meshplot
conda install -c conda-forge meshplot

# install libigl
conda install -c conda-forge igl

# download MSMS
MSMS_PATH="~/MSMS" # substitute with your own MSMS path
wget https://ccsb.scripps.edu/msms/download/933/ -O msms_i86_64Linux2_2.6.1.tar.gz
mkdir -p $MSMS_PATH # mark this directory as your $MSMS_bin for later use
tar zxvf msms_i86_64Linux2_2.6.1.tar.gz -C $MSMS_PATH

# install PyTorch 1.10.0 (e.g., with CUDA 11.3)
conda install pytorch==1.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html

# install Manifold
pip install -e .

2. Data Preparation

2.1 For Crossdock data

The data used for training / evaluating the model are organized in the nucleusdiff_data_and_checkpoint Google Drive folder.
To train the model from scratch, you need to download the preprocessed lmdb file and split file:

crossdocked_v1.1_rmsd1.0_pocket10_processed_w_manifold_data_version.lmdb
crossdocked_pocket10_pose_w_manifold_data_split.pt

To evaluate the model on the test set, you need to download and unzip the test_set.zip. It includes the original PDB files that will be used in Vina Docking.
If you want to process the dataset from scratch, you need to download CrossDocked2020 v1.1 from here, save it into ./data/CrossDocked2020, and run the scripts in ./crossdock_data_preparation:

1. clean_crossdocked.py will filter the original dataset and keep the ones with RMSD < 1A. It will generate a index.pkl file and create a new directory containing the original filtered data (corresponds to crossdocked_v1.1_rmsd1.0.tar.gz in the drive). You don't need these files if you have downloaded .lmdb file.

    python ./crossdock_data_preparation/step1_clean_crossdocked.py --source "./data/CrossDocked2020" --dest "./data/crossdocked_v1.1_rmsd1.0" --rmsd_thr 1.0

1. extract_pockets.py will clip the original protein file to a 10A region around the binding molecule. E.g.

    python ./crossdock_data_preparation/step2_extract_pockets.py --source "./data/crossdocked_v1.1_rmsd1.0" --dest "./data/crossdocked_v1.1_rmsd1.0_pocket10"

1. split_pl_dataset.py will split the training and test set. We use the same split split_by_name.pt as AR and Pocket2Mol, which can also be downloaded in the Google Drive - data folder.

    python ./crossdock_data_preparation/step3_split_pl_dataset.py --path "./data/crossdocked_v1.1_rmsd1.0_pocket10" --dest "./data/crossdocked_pocket10_pose_split.pt" --fixed_split "./data/split_by_name.pt"

2.2 For Crossdock manifold data

switch conda virtual environments

source activate Manifold

prepare input for MSMS

python step1_convert_npz_to_xyzrn.py --crossdock_source [path/to/crossdock_pocket10_auxdata/] --out_root "./data/crossdocked_pocket10_mesh"

execute MSMS to generate molecular surface

python step2_compute_msms.py --data_root "./data/crossdocked_pocket10_mesh" --msms-bin [path/to/MSMS/dir]/msms.x86_64Linux2.2.6.1

refine surface mesh

python step3_refine_mesh.py --data_root "./data/crossdocked_pocket10_mesh"

2.3 Get our final lmdb data and split.pt data

python ./datasets/pl_pair_dataset.py --data_root "./data/crossdocked_v1.1_rmsd1.0_pocket10"

3. Main experiment

3.1 Training

python train.py --lr 0.001 --device "cuda:0"  --wandb_project_name "nucleusdiff_train"  --loss_mesh_constained_weight 1

Notice: our pretrained model are organized in the nucleusdiff_data_and_checkpoint Google Drive folder.

3.2 Inference (sampling)

python sample_for_crossdock.py --ckpt_path "./logs_diffusion/nucleusdiff_train" --ckpt_it 100000 --cuda_device 0 --data_id 0

You can also speed up sampling with multiple GPUs, e.g.:

python sample_for_crossdock.py --ckpt_path "./logs_diffusion/nucleusdiff_train" --ckpt_it 100000 --cuda_device 0 --data_id 0
python sample_for_crossdock.py --ckpt_path "./logs_diffusion/nucleusdiff_train" --ckpt_it 100000 --cuda_device 1 --data_id 1
python sample_for_crossdock.py --ckpt_path "./logs_diffusion/nucleusdiff_train" --ckpt_it 100000 --cuda_device 2 --data_id 2
python sample_for_crossdock.py --ckpt_path "./logs_diffusion/nucleusdiff_train" --ckpt_it 100000 --cuda_device 3 --data_id 3

3.3 Evaluation on the General Metrics

python ./evaluation/evaluate_for_crossdock_on_collision_metrics.py --sample_path "./result_output"  --eval_step -1  --protein_root "./data/test_set"  --docking_mode "vina_dock"

3.4 Evaluation on the Collision Metrics

python ./evaluation/evaluate_for_crossdock_on_collision_metrics.py  --sample_path "./result_output"  --eval_step -1

4. Advances in Drug Design for COVID-19 and Other Therapeutic Targets

4.1 Data Preparation

If you want to process the dataset from scratch, you need to download real_world.zip from nucleusdiff_data_and_checkpoint, save it into ./data, and run the scripts in ./covid_19_data_preparation:

python ./covid_19_data_preparation/extract_pockets_for_real_world.py --source "./data/real_world" --dest "./real_world_test_extract_pockets"

4.2 Inference (sampling)

python sample_for_covid_19.py  --checkpoint [path/to/nucleusdiff/checkpoint]  --pdb_path "./real_world_test_extract_pockets/CDK2/cdk2_ligand_pocket10.pdb"  --result_path "./read_world_cdk2_test"  --sample_num_atoms "real_world_testing" --inference_num_atoms 30

4.3 Evaluation on the General Metrics

python ./evaluation/evaluate_for_covid_19_on_general_metrics.py --sample_path "./read_world_cdk2_test" --protein_root "./real_world/cdk2_processed.pdb"  --ligand_filename "CDK2" --docking_mode "vina_dock"

4.4 Evaluation on the Collision Metrics

python ./evaluation/evaluate_for_covid_19_on_collision_metrics.py --sample_path "./read_world_cdk2_test" --model "nucleusdiff_train" --target "cdk2_test"

Cite Us

Feel free to cite this work if you find it useful to you!

@article{liu2024nucleusdiff,
    title={Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design},
    author={Shengchao Liu, Divin Yan, Weitao Du, Weiyang Liu, Zhuoxinran Li, Hongyu Guo, Christian Borgs, Jennifer Chayes, Anima Anandkumar},
    journal={arXiv preprint},
    year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NucleusDiff: Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design

1. Installation

1.1 Dependency for main experiment

1.2 Dependency for preprocessing the crossdock manifold data (we present this special environment if you want to process the manifold dataset from scratch.)

2. Data Preparation

2.1 For Crossdock data

2.2 For Crossdock manifold data

2.3 Get our final lmdb data and split.pt data

3. Main experiment

3.1 Training

3.2 Inference (sampling)

3.3 Evaluation on the General Metrics

3.4 Evaluation on the Collision Metrics

4. Advances in Drug Design for COVID-19 and Other Therapeutic Targets

4.1 Data Preparation

4.2 Inference (sampling)

4.3 Evaluation on the General Metrics

4.4 Evaluation on the Collision Metrics

Cite Us

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
collision_test		collision_test
covid_19_data_preparation		covid_19_data_preparation
crossdock_data_preparation		crossdock_data_preparation
datasets		datasets
evaluation		evaluation
utils		utils
LICENSE		LICENSE
README.md		README.md
sample_for_covid_19.py		sample_for_covid_19.py
sample_for_crossdock.py		sample_for_crossdock.py
setup.py		setup.py
train.py		train.py

License

RareCompute/NucleusDiff

Folders and files

Latest commit

History

Repository files navigation

NucleusDiff: Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design

1. Installation

1.1 Dependency for main experiment

1.2 Dependency for preprocessing the crossdock manifold data (we present this special environment if you want to process the manifold dataset from scratch.)

2. Data Preparation

2.1 For Crossdock data

2.2 For Crossdock manifold data

2.3 Get our final lmdb data and split.pt data

3. Main experiment

3.1 Training

3.2 Inference (sampling)

3.3 Evaluation on the General Metrics

3.4 Evaluation on the Collision Metrics

4. Advances in Drug Design for COVID-19 and Other Therapeutic Targets

4.1 Data Preparation

4.2 Inference (sampling)

4.3 Evaluation on the General Metrics

4.4 Evaluation on the Collision Metrics

Cite Us

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages