Skip to content

Code for the Systems Biology: Computational Analysis and Interpretation of High-throughput Data course at HU Berlin.

Notifications You must be signed in to change notification settings

birchgit/SingleCell_type

Repository files navigation

Project Overview

This project compares deep learning (DNN) and XGBoost models for classifying cell types using single-cell RNA sequencing (scRNA-seq) data. We evaluate model performance on datasets of varying sizes (50K, 60K cells), using three preprocessing strategies:

  • Raw features
  • PCA-reduced features
  • Randomly undersampled (RUS) features

Workflow Summary

  1. Data Selection
  2. Preprocessing & Data Reduction
  3. Data Splitting
  4. Model Training (DNN, XGBoost)
  5. Evaluation & Metrics
  6. Results & Visualization

Project Workflow

1. Data Selection

  • Selected two large scRNA-seq datasets (50K+ cells each) from the Cellxgene database.
  • Chose datasets with a diverse range of cell types and class imbalance to simulate realistic classification challenges.

2. Preprocessing and Data Reduction

  • Normalization: Applied log normalization and scaling.
  • Filtering: Removed low-quality cells with:
    • High mitochondrial gene expression
    • Low gene expression
  • Annotation: Annotated cell types for supervised learning.
  • PCA (Principal Component Analysis) was used to reduce the number of features from thousands of genes to a lower-dimensional representation, capturing the most variance.
  • Random Undersampling (RUS) was applied to reduce the number of majority class samples, addressing class imbalance while preserving minority cell types.

Each preprocessing strategy (Raw, PCA, RUS) was used to create distinct input datasets for model evaluation.

3. Data Splitting

  • Split ratio: 80% training / 20% testing.

4. Model Training

  • Trained two types of models:
    • Deep Neural Networks (DNN)
    • XGBoost classifiers
  • Each model was trained on all 3 dataset versions (raw, PCA, RUS).

5. Evaluation

  • Evaluated using:
    • Accuracy
    • F1 Score
    • Confusion Matrix
  • Benchmarked computational performance for each model-dataset combination.

6. Results & Visualization

  • Performance metrics visualized using matplotlib.
  • Appendix figures demonstrate model scalability and resource consumption.

Repository Structure

.
├── notebooks/
│   ├── DNN_50K_Raw.ipynb
│   ├── DNN_50K_PCA.ipynb
│   ├── DNN_50K_RUS.ipynb
│   ├── DNN_60K_Raw.ipynb
│   ├── DNN_60K_PCA.ipynb
│   ├── DNN_60K_RUS.ipynb
│   └── XGBoost.ipynb
├── figures/
│   └── appendix_figure_A_1.png
├── README.md

About

Code for the Systems Biology: Computational Analysis and Interpretation of High-throughput Data course at HU Berlin.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •