GitHub - birchgit/SingleCell_type: Code for the Systems Biology: Computational Analysis and Interpretation of High-throughput Data course at HU Berlin.

Project Overview

This project compares deep learning (DNN) and XGBoost models for classifying cell types using single-cell RNA sequencing (scRNA-seq) data. We evaluate model performance on datasets of varying sizes (50K, 60K cells), using three preprocessing strategies:

Raw features
PCA-reduced features
Randomly undersampled (RUS) features

Workflow Summary

Data Selection
Preprocessing & Data Reduction
Data Splitting
Model Training (DNN, XGBoost)
Evaluation & Metrics
Results & Visualization

Project Workflow

1. Data Selection

Selected two large scRNA-seq datasets (50K+ cells each) from the Cellxgene database.
Chose datasets with a diverse range of cell types and class imbalance to simulate realistic classification challenges.

2. Preprocessing and Data Reduction

Normalization: Applied log normalization and scaling.
Filtering: Removed low-quality cells with:
- High mitochondrial gene expression
- Low gene expression
Annotation: Annotated cell types for supervised learning.
PCA (Principal Component Analysis) was used to reduce the number of features from thousands of genes to a lower-dimensional representation, capturing the most variance.
Random Undersampling (RUS) was applied to reduce the number of majority class samples, addressing class imbalance while preserving minority cell types.

Each preprocessing strategy (Raw, PCA, RUS) was used to create distinct input datasets for model evaluation.

3. Data Splitting

Split ratio: 80% training / 20% testing.

4. Model Training

Trained two types of models:
- Deep Neural Networks (DNN)
- XGBoost classifiers
Each model was trained on all 3 dataset versions (raw, PCA, RUS).

5. Evaluation

Evaluated using:
- Accuracy
- F1 Score
- Confusion Matrix
Benchmarked computational performance for each model-dataset combination.

6. Results & Visualization

Performance metrics visualized using matplotlib.
Appendix figures demonstrate model scalability and resource consumption.

Repository Structure

.
├── notebooks/
│   ├── DNN_50K_Raw.ipynb
│   ├── DNN_50K_PCA.ipynb
│   ├── DNN_50K_RUS.ipynb
│   ├── DNN_60K_Raw.ipynb
│   ├── DNN_60K_PCA.ipynb
│   ├── DNN_60K_RUS.ipynb
│   └── XGBoost.ipynb
├── figures/
│   └── appendix_figure_A_1.png
├── README.md

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
SC_DNN_50K_pca.ipynb		SC_DNN_50K_pca.ipynb
SC_DNN_50K_raw.ipynb		SC_DNN_50K_raw.ipynb
SC_DNN_50K_rus.ipynb		SC_DNN_50K_rus.ipynb
SC_DNN_60K_pca.ipynb		SC_DNN_60K_pca.ipynb
SC_DNN_60K_raw.ipynb		SC_DNN_60K_raw.ipynb
SC_DNN_60K_rus.ipynb		SC_DNN_60K_rus.ipynb
SC_XGBoost.ipynb		SC_XGBoost.ipynb
appendix_figure_A_1.png		appendix_figure_A_1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project Overview

Workflow Summary

Project Workflow

1. Data Selection

2. Preprocessing and Data Reduction

3. Data Splitting

4. Model Training

5. Evaluation

6. Results & Visualization

Repository Structure

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

birchgit/SingleCell_type

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Workflow Summary

Project Workflow

1. Data Selection

2. Preprocessing and Data Reduction

3. Data Splitting

4. Model Training

5. Evaluation

6. Results & Visualization

Repository Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages