-
Annotation is the process by which pertinent information about raw DNA sequences is added to genome databases. Multiple software applications have been developed to annotate genetic variants that can be derived automatically from diverse genomes (e.g., ANNOVAR, SnpEff). The first shortcoming of the existing tools relates to downloading the software and the large build files. The second problem is scalability. Because current tools are mainly sequential or parallel only at the node level (requiring a large machine with many cores and a large main memory), annotating of large numbers of patients is tedious and takes a significant amount of time.
-
The pay-as-you-go model of cloud computing, which eliminates the maintenance effort required for a high performance computing (HPC) facility while simultaneously offering elastic scalability, is well suited for genomic analysis.
-
In this project, we developed a cloud-based annotation engine that annotates input datasets (e.g., VCF, mVCF files) in the cloud using distributed algorithms.
-
Version 1.0
-
Install the Google Cloud SDK, including the gcloud tool.
-
Setup the gcloud tool.
gcloud init
-
Authentication
gcloud auth application-default login
-
Clone this repo.
git clone https://github.com/StanfordBioinformatics/AnnotationHive.git
-
Install Maven.
-
Create a container.
docker run -it annotationhive/annotationhive_public:v1.6 bash
-
Authentication
gcloud auth application-default login
-
Set your GCP project
gcloud config set project <PROJECT-ID>
Section 1: Import VCF/mVCF/Annotation Files
This section explains how to import VCF, mVCF and annotation files to BigQuery.
Section 2: List Available Public Annotation Datasets
This part of the code demonstrates how to list AnnotationHive's public datasets.
Section 3: Variant-based Annotation
This section explains how to annotate a VCF/mVCF table against any number of variant-based annotation datasets.
Section 4: Interval-based Annotation
This section explains how to annotate a VCF/mVCF table against any number of interval-based annotation datasets.
Section 5: Variant-based and Interval-based Annotation
This section explains how to run a combination of interval-based and variant-based annotation datasets.
Section 6: Gene-based Annotation
This section demonstrates how to run our gene-based annotation process for a VCF/mVCF table.
Section 7: Sample Experiments
This section presents several experiments on scalability and the cost of the system.
Section 8: Export Annotated VCF Table
This section explains how to export an annotated VCF file.
This section explains how to annotate a small number of regions/variants.
Section 10: Import Private Annotation Datasets
This section explains how to import private annotation datasets.