scVI / CITE-seq — Variational Autoencoders for Single-Cell Data

This repository contains a minimal, self-contained implementation of variational autoencoder models for single-cell data, inspired by scVI and its multimodal extension for CITE-seq (RNA + protein).

The code is written in PyTorch and follows a clear separation between data handling, model definition, probabilistic modeling, training, and evaluation.

Context

This project was developed as part of the course:

Introduction to Probabilistic Graphical Models
Pierre Latouche — Pierre-Alexandre Mattei
MVA (Master Mathématiques, Vision, Apprentissage)

Course webpage:
https://lmbp.uca.fr/~latouche/mva/IntroductiontoProbabilisticGraphicalModelsMVA.html

The implementation aims to connect:

probabilistic graphical modeling concepts (latent variables, ELBO),
variational inference,
deep generative models applied to real biological data.

Models implemented

SCVI
RNA-only variational autoencoder with:
- latent biological variable z
- latent library size variable l
- batch-specific priors
- ZINB likelihood for gene expression
CiteVI
Multimodal extension for CITE-seq data:
- shared latent variable z
- RNA modeled with ZINB
- proteins (ADT) modeled with a negative binomial distribution

Both models are trained by maximizing an evidence lower bound (ELBO).

Repository structure

src/
├── data.py              # dataset definition and data loaders
├── building_blocks.py   # neural network components (encoders / decoders)
├── distributions.py     # KL terms, likelihoods, ELBOs
├── models.py            # SCVI and CiteVI models
├── training.py          # training loops
└── evaluation.py        # diagnostics and visualization

notebooks/
├── scVI.ipynb
└── citeVI.ipynb

data/
├── README.md            # data download instructions
├── cortex_rna.txt       # RNA-only dataset (not versioned)
└── pbmc_citeseq/        # CITE-seq dataset (not versioned)

Raw datasets are not tracked in the repository.

Data

Cortex scRNA-seq (RNA-only): Used as a benchmark dataset for scVI-style models.
PBMC 2019 CITE-seq (RNA + protein): Human PBMC dataset combining transcriptomic and surface protein measurements.

Detailed download instructions are provided in data/README.md.

Usage

Training and evaluation are demonstrated in the notebooks located in the notebooks/ directory.

Typical workflow:

Download the data (see data/README.md)
Run the corresponding notebook
Train the model using the provided training loop
Evaluate the learned latent space via reconstruction metrics and t-SNE

Scope and limitations

This repository is intended for:

educational purposes,
understanding variational inference in practice,
illustrating probabilistic modeling choices for single-cell data.

It is not meant to be a production-ready alternative to scvi-tools.

References

Lopez R. et al., Deep generative modeling for single-cell transcriptomics
Stuart T. et al., Comprehensive Integration of Single-Cell Data, Cell (2019)
Kingma & Welling, Auto-Encoding Variational Bayes

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
Intro_to_PGM__Poster___Paysage_.pdf		Intro_to_PGM__Poster___Paysage_.pdf
README.md		README.md
VAE_for_sc_data_report.pdf		VAE_for_sc_data_report.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scVI / CITE-seq — Variational Autoencoders for Single-Cell Data

Context

Models implemented

Repository structure

Data

Usage

Scope and limitations

References

About

Uh oh!

Releases

Packages

Languages

valbad/scvi-citevi-demo

Folders and files

Latest commit

History

Repository files navigation

scVI / CITE-seq — Variational Autoencoders for Single-Cell Data

Context

Models implemented

Repository structure

Data

Usage

Scope and limitations

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages