Section 5: Variant-based and Interval-based Annotation

This section explains how to run a combination of interval-based and variant-based annotation datasets.

mVCF/VCF

After successfully importing the mVCF/VCF files (refer to Section 1), they can be run them against a number of annotation datasets (to list annotation datasets refer to Section 2).

Two types of annotation datasets are available: 1) Variant annotation datasets that contain chrom, start, end, ref, alt, ..., and 2) Generic annotation datasets that contain chrom, start, end, .... The main difference is that Generic annotation datasets do not have any ref or alt fields.

Here are the key parameters for interval-based annotation:

--bigQueryDatasetId: Specify the BigQuery dataset ID that will contain the output annotated VCF table.
--VCFTables: Specify the BigQuery address of the mVCF/VCF table on BigQuery
--VCFCanonicalizeRefNames: Specify the prefix for the reference field in the VCF tables (e.g, "chr"). AnnotationHive automatically canonicalizes the VCF table by removing the prefix in its calculation.
--projectId: Specify the project ID that has access to the VCF tables.
--bucketAddrAnnotatedVCF: Specify the full bucket and name address to the output VCF File (e.g., gs://mybucket/outputVCF.vcf).
--genericAnnotationTables: Specify the address of the Generic annotation tables (e.g., gbsc-gcp-project-cba:AnnotationHive.hg19_UCSC_RefGene).
--variantAnnotationTables: Specify the address of the Variant annotation tables (e.g., gbsc-gcp-project-cba:AnnotationHive.hg19_UCSC_snp144).
--createVCF: If you wish to obtain a VCF file, then set this flag true (default value is false, and it creates a table).

AnnotationHive VCF Table

To run interval-based annotation, after importing the mVCF/VCF file using AnnotationHive's API, then modify and run the following command:

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=<YOUR_Google_Cloud_Project_ID> --runner=DataflowRunner --bigQueryDatasetId=<Your_BigQuery_DatasetId>  --outputBigQueryTable=<Output_VCF_Table_Name> --variantAnnotationTables=<ProjectID>:<DatasetID>.<AnnotationTableID>:<Field1>:<Field2>:...:<FieldN> --genericAnnotationTables=<ProjectID>:<DatasetID>.<AnnotationTableID>:<Field1>:<Field2>:...:<FieldN>  --VCFTables=<ProjectID>:<DatasetID>.<VCF/mVCF_Table_ID> --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/<Staging_Address>/" -Pdataflow-runner

For our test example:

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=<YOUR_Project_ID> --runner=DataflowRunner --bigQueryDatasetId=test  --outputBigQueryTable=annotate_variant_transcript_test_chr17 --genericAnnotationTables=<YOUR_Project_ID>:test.sample_transcript_annotation_chr17:name:name2  --variantAnnotationTables=<YOUR_Project_ID>:test.sample_variant_annotation_chr17:alleleFreq:dbsnpid --VCFTables=<YOUR_Project_ID>:test.NA12877_chr17 --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/staging" -Pdataflow-runner

Google VCF Table

If the VCF table was imported using Google APIs, then set --googleVCF=true. Here is a test example for the 1000 Genomes Project mVCF file imported by Google Genomics.

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=<YOUR_Project_ID> --runner=DataflowRunner --bigQueryDatasetId=test  --outputBigQueryTable=annotate_variant_transcript_Google_1000_test_chr17 --genericAnnotationTables=<YOUR_Project_ID>:test.sample_transcript_annotation_chr17:name:name2  --variantAnnotationTables=<YOUR_Project_ID>:test.sample_variant_annotation_chr17:alleleFreq:dbsnpid --VCFTables=genomics-public-data:1000_genomes_phase_3.variants --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/staging --googleVCF=true" -Pdataflow-runner

Now, if you want AnnotationHive to calculate the number of samples that have variants, then set --numberSamples=true. Here is a test example for the 1000 Genomes Project mVCF file imported by Google Genomics (Total number of samples in the mVCF file: 2,504). AnnotationHive will find the number of samples that have variants:

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=YOUR_Project_ID --runner=DataflowRunner --bigQueryDatasetId=test  --outputBigQueryTable=annotate_variant_transcript_Google_1000_test_chr17_with_num_samples --genericAnnotationTables=<YOUR_Project_ID>:test.sample_transcript_annotation_chr17:name:name2  --variantAnnotationTables=<YOUR_Project_ID>:test.sample_variant_annotation_chr17:alleleFreq:dbsnpid --VCFTables=genomics-public-data:1000_genomes_phase_3.variants --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/staging --googleVCF=true --numberSamples=true" -Pdataflow-runner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variant-Interval-Annotation.md

Variant-Interval-Annotation.md

Section 5: Variant-based and Interval-based Annotation

mVCF/VCF

AnnotationHive VCF Table

Google VCF Table

Files

Variant-Interval-Annotation.md

Latest commit

History

Variant-Interval-Annotation.md

File metadata and controls

Section 5: Variant-based and Interval-based Annotation

mVCF/VCF

AnnotationHive VCF Table

Google VCF Table