Skip to content

bigbio/hvantk

Repository files navigation

Python Package using Conda Python application

hvantk

Hail-based toolkit for multiomics variant annotation and analysis.

hvantk is a modular toolkit that uses Hail to annotate and analyze variants, genes, proteins, and expression data from heterogeneous omics sources. The library enables multiomics integration to improve the interpretation of genetic variants.

Core Capabilities:

  • Variant annotations (ClinVar, dbNSFP, gnomAD, CCR scores)
  • Gene annotations (Ensembl, GeVIR, gene constraints)
  • Protein annotations (INSIDER protein-protein interactions)
  • Expression data (bulk & single-cell RNA-seq from UCSC, GTEx)
  • Joint genotyping workflows (GVCF combining, QC, format conversion)
  • Recipe-based batch processing

Installation

git clone https://github.com/bigbio/hvantk
cd hvantk
poetry install
poetry shell

Using pip

git clone https://github.com/bigbio/hvantk
cd hvantk
pip install -e .

Prerequisites: Python ≥3.10, Hail

Core Workflows

HGC: Joint Genotyping Pipeline

High-performance joint genotyping for large cohorts. Combines thousands of GVCF files with integrated QC.

# End-to-end pipeline
hvantk hgc pipeline -i /data/gvcfs -o /output

# Or run individual steps
hvantk hgc gvcf-combine -g /data/gvcfs -o cohort.vds
hvantk hgc vds2mt -i cohort.vds -o cohort.mt
hvantk hgc qc-report -i cohort.mt -o qc_report.html

📖 Full HGC Documentation

PSROC: Variant Score Evaluation

Evaluate pathogenicity prediction scores (CADD, REVEL, MetaLR) using ClinVar truth labels. Generate ROC curves and performance metrics.

# Run ROC analysis
hvantk psroc \
  --genes-file genes.txt \
  --clinvar-ht clinvar.ht \
  --dbnsfp-ht dbnsfp.ht \
  --scores "CADD_phred,REVEL_score" \
  --output-dir results/

📖 PSROC Documentation | Example

Annotation Tables

Create Hail Tables from public databases (ClinVar, gnomAD, Ensembl).

# Single table
hvantk mktable clinvar --raw-input clinvar.vcf.bgz --output-ht clinvar.ht

# Batch processing
hvantk mktable-batch --recipe tables_recipe.json

📖 Tables Guide

Expression Matrices

Build Hail MatrixTables from bulk and single-cell expression data.

# UCSC Cell Browser data
hvantk mkmatrix ucsc -e expr.tsv.bgz -m metadata.tsv -o ucsc.mt

# Batch processing
hvantk mkmatrix-batch --recipe matrices_recipe.json

📖 Expression Guide

Data Downloaders

Download curated datasets from public repositories.

hvantk ucsc-downloader --dataset adultPancreas --output-dir data/ucsc

📖 Data Sources

Quick Start

# Download and process expression data
hvantk ucsc-downloader --dataset adultPancreas --output-dir data/ucsc
hvantk mkmatrix ucsc -e data/ucsc/exprMatrix.tsv.bgz -m data/ucsc/meta.tsv -o data/ucsc/adultPancreas.mt

# Build annotation tables
hvantk mktable clinvar --raw-input clinvar.vcf.bgz --output-ht clinvar.ht --ref-genome GRCh38

# Or use batch processing with recipes (see examples/recipes/)
hvantk mktable-batch --recipe recipe.json

Documentation

Citation

If you use hvantk in your research, please cite:

@software{hvantk2024,
  title = {hvantk: Hail-based toolkit for multi-omics variant annotation and analysis},
  author = {Perez-Riverol, Yasset and Audain, Enrique},
  year = {2024},
  url = {https://github.com/bigbio/hvantk}
}

Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed information on:

  • Development workflow and setup
  • Adding new data sources
  • Code style guidelines
  • Testing requirements
  • Pull request process

Developer quick start:

poetry install
pytest -q
hvantk --help

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Acknowledgments

  • Built on Hail for distributed genomic data processing
  • Integrates data from ClinVar, gnomAD, Ensembl, UCSC, and other public resources

About

Hail variant annotation toolkit

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6

Languages