Code from "Exploring optimal transport-based multi-grained alignments for text-molecule retrieval" (IEEE BIBM 2024)
To train the model, use the following command:
bash train.sh ${CUDA_DEVICE}The project consists of the following files:
data/: We use the ChEBI-20 dataset from text2mol for the main experiments. For training, val and test sets, we discard invalid molecules without any chemical bonds. Additionally, we add CanonicalSMILES and molecule names from PubChem for these three sets.graph_data/: unzipmol_graphs.zipfrom text2moltoken_embedding_dict.npy: from text2moltraining.csv: processed bypreprocess.pybased ontraining.txtfrom text2molval.csv: processed bypreprocess.pybased onval.txtfrom text2moltest.csv: processed bypreprocess.pybased ontest.txtfrom text2molpreprocess.py: runpython3 preprocess.py
allenai_scibert_scivocab_uncased/: SciBERT path.config.jsontrain.shmain.pymodeling.pydataloader.pychemutils.pyutils.pyrequirements.txt