Hail-based toolkit for multiomics variant annotation and analysis.
hvantk is a modular toolkit that uses Hail to annotate and analyze variants, genes, proteins, and expression data from heterogeneous omics sources. The library enables multiomics integration to improve the interpretation of genetic variants.
Core Capabilities:
- Variant annotations (ClinVar, dbNSFP, gnomAD, CCR scores)
- Gene annotations (Ensembl, GeVIR, gene constraints)
- Protein annotations (INSIDER protein-protein interactions)
- Expression data (bulk & single-cell RNA-seq from UCSC, GTEx)
- Joint genotyping workflows (GVCF combining, QC, format conversion)
- Recipe-based batch processing
git clone https://github.com/bigbio/hvantk
cd hvantk
poetry install
poetry shellgit clone https://github.com/bigbio/hvantk
cd hvantk
pip install -e .Prerequisites: Python ≥3.10, Hail
High-performance joint genotyping for large cohorts. Combines thousands of GVCF files with integrated QC.
# End-to-end pipeline
hvantk hgc pipeline -i /data/gvcfs -o /output
# Or run individual steps
hvantk hgc gvcf-combine -g /data/gvcfs -o cohort.vds
hvantk hgc vds2mt -i cohort.vds -o cohort.mt
hvantk hgc qc-report -i cohort.mt -o qc_report.htmlEvaluate pathogenicity prediction scores (CADD, REVEL, MetaLR) using ClinVar truth labels. Generate ROC curves and performance metrics.
# Run ROC analysis
hvantk psroc \
--genes-file genes.txt \
--clinvar-ht clinvar.ht \
--dbnsfp-ht dbnsfp.ht \
--scores "CADD_phred,REVEL_score" \
--output-dir results/📖 PSROC Documentation | Example
Create Hail Tables from public databases (ClinVar, gnomAD, Ensembl).
# Single table
hvantk mktable clinvar --raw-input clinvar.vcf.bgz --output-ht clinvar.ht
# Batch processing
hvantk mktable-batch --recipe tables_recipe.jsonBuild Hail MatrixTables from bulk and single-cell expression data.
# UCSC Cell Browser data
hvantk mkmatrix ucsc -e expr.tsv.bgz -m metadata.tsv -o ucsc.mt
# Batch processing
hvantk mkmatrix-batch --recipe matrices_recipe.jsonDownload curated datasets from public repositories.
hvantk ucsc-downloader --dataset adultPancreas --output-dir data/ucsc# Download and process expression data
hvantk ucsc-downloader --dataset adultPancreas --output-dir data/ucsc
hvantk mkmatrix ucsc -e data/ucsc/exprMatrix.tsv.bgz -m data/ucsc/meta.tsv -o data/ucsc/adultPancreas.mt
# Build annotation tables
hvantk mktable clinvar --raw-input clinvar.vcf.bgz --output-ht clinvar.ht --ref-genome GRCh38
# Or use batch processing with recipes (see examples/recipes/)
hvantk mktable-batch --recipe recipe.json- Usage Guide - Examples and recipes
- HGC Tool - Joint genotyping pipeline
- PSROC Tool - Variant score evaluation
- Data Sources - Available annotations
- Architecture - Design and extension points
- Full Index - Complete documentation
If you use hvantk in your research, please cite:
@software{hvantk2024,
title = {hvantk: Hail-based toolkit for multi-omics variant annotation and analysis},
author = {Perez-Riverol, Yasset and Audain, Enrique},
year = {2024},
url = {https://github.com/bigbio/hvantk}
}We welcome contributions! Please see CONTRIBUTING.md for detailed information on:
- Development workflow and setup
- Adding new data sources
- Code style guidelines
- Testing requirements
- Pull request process
Developer quick start:
poetry install
pytest -q
hvantk --helpThis project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Questions: Open a discussion on GitHub
- Documentation: docs/
- Built on Hail for distributed genomic data processing
- Integrates data from ClinVar, gnomAD, Ensembl, UCSC, and other public resources