Benchmarking Self-Supervised Learning for Single-Cell Data

This repository implements a benchmark of augmentations for single-cell RNA-seq data with contrastive learning.

We evaluate augmentations across three common model architectures.

All architectures share the same encoder architecture. They differ in various details, such as:

the employed loss function
usage of a memory bank
implementation of nearest-neighbor-embeddings
usage of a projector
usage of a predictor

Scope

Goal of this project is to advertise further research on contrastive self-supervised learning for cell representation learning. Our work shows that current methods are able to correct for batch effects, improving performance on downstream tasks.

Limitations

This work is a first step towards a wider application of CL for cell representation learning. We note that there are parameters and architectural choices that were not considered during this work due to computational constraints. For the presented models, there are many parameters (and hyperparameters) that could be improved upon. This is left as future work.

Implementation Details

To install a conda / miniconda / mamba environment for reproducibility, call conda create --name <env> --file requirements.txt.

We use hydra to schedule experiments (see conf folder) and lightly to define the neural networks (see model folder). Model training is performed with pytorch-lightning, the ADAM optimizer and a constant learning rate 1e-4.

To schedule experiments from the conf folder, define the data_path in the corresponding file of the conf/data directory. Models and augmentations, as well as the dataset, are defined in the experiment yaml-file.

To train the model(s), run

python main.py --multirun +experiment=<experiment_name>

To schedule multiple runs with slurm, use

python main.py --multirun +experiment=<experiment_name> +cluster=slurm

Availability of Augmentations

This work evaluates various single-cell augmentations. To use the augmentations in another project:

from main import load_data

train_dataset, val_dataset, adata = load_data(config)

where config is required to be a dictionary (e.g. stemming from a .yaml-file) with entries in config["augmentation"].

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
conf		conf
data		data
moco		moco
models		models
pcl		pcl
utils		utils
DEBUG.py		DEBUG.py
README.md		README.md
evaluator.py		evaluator.py
finalize_scib.py		finalize_scib.py
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh
run_nohup.sh		run_nohup.sh
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Self-Supervised Learning for Single-Cell Data

Scope

Limitations

Implementation Details

Availability of Augmentations

About

Releases

Packages

Contributors 2

Languages

BoevaLab/scAugmentBench

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Self-Supervised Learning for Single-Cell Data

Scope

Limitations

Implementation Details

Availability of Augmentations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages