This project aims to predict a molecule's biological activity, specifically its Mechanism of Action (MoA), using various machine learning algorithms. The MoA of a drug refers to the biological process by which it produces its therapeutic effects. This work is based on a dataset provided by The Connectivity Map, in partnership with MIT, Harvard, LISH, and the NIH.
The dataset, sourced from the Laboratory for Innovation Science at Harvard via a Kaggle competition, combines gene expression and cell viability data. It provides insights into the activity of genes and the responses of cells to various drugs across 100 different cell types.
- 23,814 rows and 875 features
- Discrete attributes: 'cp_type', 'cp_time', 'cp_dose'
We approached this as a multi-label classification problem, implementing and comparing several machine learning models:
- Naive Bayes (Baseline)
- Logistic Regression
- Support Vector Machines (SVM)
- Decision Trees
- Random Forest
- Artificial Neural Networks (ANN)
- K-Nearest Neighbors (KNN)
- Convolutional Neural Networks (CNN)
- One-hot encoding for discrete attributes
- Feature scaling and standardization
- Principal Component Analysis (PCA) for dimensionality reduction
- Sampling of the dataset to improve computational efficiency
Model performances were evaluated using log loss. Key findings:
- CNN demonstrated the best performance on unseen data with a testing loss of 0.01657
- Logistic Regression, SVM, and ANN showed strong performance
- Random Forest outperformed individual Decision Trees
- Naive Bayes, Decision Trees, and KNN served as baseline models with comparatively poor results
This study contributes novel algorithms for MoA prediction and provides a clear rationale for parameter selections, enhancing the interpretability and applicability of the proposed models in the domain of Mechanism of Action prediction.
- Implement additional deep learning models
- Further optimize hyperparameters
- Explore advanced feature engineering techniques