Skip to content

Latest commit

 

History

History
125 lines (90 loc) · 6.38 KB

README.md

File metadata and controls

125 lines (90 loc) · 6.38 KB

README

AnnotationHive

  • Annotation is the process by which pertinent information about raw DNA sequences is added to genome databases. Multiple software applications have been developed to annotate genetic variants that can be derived automatically from diverse genomes (e.g., ANNOVAR, SnpEff). The first shortcoming of the existing tools relates to downloading the software and the large build files. The second problem is scalability. Because current tools are mainly sequential or parallel only at the node level (requiring a large machine with many cores and a large main memory), annotating of large numbers of patients is tedious and takes a significant amount of time.

  • The pay-as-you-go model of cloud computing, which eliminates the maintenance effort required for a high performance computing (HPC) facility while simultaneously offering elastic scalability, is well suited for genomic analysis.

  • In this project, we developed a cloud-based annotation engine that annotates input datasets (e.g., VCF, mVCF files) in the cloud using distributed algorithms.

  • Version 1.0

Quickstart

  1. Install the Google Cloud SDK, including the gcloud tool.

  2. Setup the gcloud tool.

    gcloud init
    
  3. Authentication

    gcloud auth application-default login
    
  4. Clone this repo.

    git clone https://github.com/StanfordBioinformatics/AnnotationHive.git
    
  5. Install Maven.

Containerized Version

  1. Create a container.

    docker run -it annotationhive/annotationhive_public:v1.6 bash
    
  2. Authentication

    gcloud auth application-default login
    
  3. Set your GCP project

    gcloud config set project <PROJECT-ID>
    

This section explains how to import VCF, mVCF and annotation files to BigQuery.

This part of the code demonstrates how to list AnnotationHive's public datasets.

This section explains how to annotate a VCF/mVCF table against any number of variant-based annotation datasets.

This section explains how to annotate a VCF/mVCF table against any number of interval-based annotation datasets.

This section explains how to run a combination of interval-based and variant-based annotation datasets.

This section demonstrates how to run our gene-based annotation process for a VCF/mVCF table.

This section presents several experiments on scalability and the cost of the system.

This section explains how to export an annotated VCF file.

This section explains how to annotate a small number of regions/variants.

This section explains how to import private annotation datasets.