Skip to content

Latest commit

 

History

History
78 lines (68 loc) · 3.2 KB

File metadata and controls

78 lines (68 loc) · 3.2 KB

Dataset and Evaluation Preparation for ChatDrug

First please make and go to the data folder:

mkdir -p data
cd data

And then do the following for dataset and evaluation preparation.

Small Molecule Editing

  • For small molecule editing dataset, please check small_molecule_editing.txt. Credit to MoleculeSTM paper.
  • For the retrieval database, please use the ZINC250K dataset from here.

Peptide Editing

  • Both the editing and retrieval dataset can be found in this repo.
  • We provide most of the pretrained datasets in peptide. You only need to download the Data_S3.csv from this link.
  • If you want to do the data preprocessing yourself, please refer to the following:
cd peptide
python preprocess_step_1_data_extraction.py
python preprocess_step_2_single_prop.py
python preprocess_step_3_multi_prop.py

Protein Editing

  • Download dataset from this google drive.
  • Unzip to protein folder.
  • This includes both the editing and retrieval dataset.
  • For evaluation, please download pytorch_model_ss3.bin from this link. Credit to ProteinDT.
.
├── peptide
│   ├── class1_pseudosequences.csv
│   ├── Data_S3.csv
│   ├── models_class1_presentation
│   │   ├── 10755300.stderr
│   │   .
│   │   .
│   │   .
│   │   └── train_data.csv.bz2
│   ├── peptide_editing.json
│   ├── peptide_editing.json
│   ├── peptide_editing_threshold.json
│   ├── preprocess_step_1_data_extraction.py
│   ├── preprocess_step_2_single_prop.py
│   ├── preprocess_step_3_multi_prop.py
│   └── selected_alleles.txt
├── protein
│   ├── downstream_datasets
│   │   └── secondary_structure
│   │       ├── secondary_structure_casp12.lmdb
│   │       │   ├── data.mdb
│   │       │   └── lock.mdb
│   │       ├── secondary_structure_cb513.lmdb
│   │       │   ├── data.mdb
│   │       │   └── lock.mdb
│   │       ├── secondary_structure_train.lmdb
│   │       │   ├── data.mdb
│   │       │   └── lock.mdb
│   │       ├── secondary_structure_ts115.lmdb
│   │       │   ├── data.mdb
│   │       │   └── lock.mdb
│   │       └── secondary_structure_valid.lmdb
│   │           ├── data.mdb
│   │           └── lock.mdb
│   ├── pytorch_model_ss3.bin
│   └── pytorch_model_ss8.bin
├── README.md
└── small_molecule
    ├── 250k_rndm_zinc_drugs_clean_3.csv
    └── small_molecule_editing.txt