Skip to content

Latest commit

 

History

History
48 lines (35 loc) · 5.01 KB

Variant-Interval-Annotation.md

File metadata and controls

48 lines (35 loc) · 5.01 KB

Section 5: Variant-based and Interval-based Annotation

This section explains how to run a combination of interval-based and variant-based annotation datasets.

mVCF/VCF

After successfully importing the mVCF/VCF files (refer to Section 1), they can be run them against a number of annotation datasets (to list annotation datasets refer to Section 2).

Two types of annotation datasets are available: 1) Variant annotation datasets that contain chrom, start, end, ref, alt, ..., and 2) Generic annotation datasets that contain chrom, start, end, .... The main difference is that Generic annotation datasets do not have any ref or alt fields.

Here are the key parameters for interval-based annotation:

  • --bigQueryDatasetId: Specify the BigQuery dataset ID that will contain the output annotated VCF table.
  • --VCFTables: Specify the BigQuery address of the mVCF/VCF table on BigQuery
  • --VCFCanonicalizeRefNames: Specify the prefix for the reference field in the VCF tables (e.g, "chr"). AnnotationHive automatically canonicalizes the VCF table by removing the prefix in its calculation.
  • --projectId: Specify the project ID that has access to the VCF tables.
  • --bucketAddrAnnotatedVCF: Specify the full bucket and name address to the output VCF File (e.g., gs://mybucket/outputVCF.vcf).
  • --genericAnnotationTables: Specify the address of the Generic annotation tables (e.g., gbsc-gcp-project-cba:AnnotationHive.hg19_UCSC_RefGene).
  • --variantAnnotationTables: Specify the address of the Variant annotation tables (e.g., gbsc-gcp-project-cba:AnnotationHive.hg19_UCSC_snp144).
  • --createVCF: If you wish to obtain a VCF file, then set this flag true (default value is false, and it creates a table).

AnnotationHive VCF Table

To run interval-based annotation, after importing the mVCF/VCF file using AnnotationHive's API, then modify and run the following command:

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=<YOUR_Google_Cloud_Project_ID> --runner=DataflowRunner --bigQueryDatasetId=<Your_BigQuery_DatasetId>  --outputBigQueryTable=<Output_VCF_Table_Name> --variantAnnotationTables=<ProjectID>:<DatasetID>.<AnnotationTableID>:<Field1>:<Field2>:...:<FieldN> --genericAnnotationTables=<ProjectID>:<DatasetID>.<AnnotationTableID>:<Field1>:<Field2>:...:<FieldN>  --VCFTables=<ProjectID>:<DatasetID>.<VCF/mVCF_Table_ID> --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/<Staging_Address>/" -Pdataflow-runner

For our test example:

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=<YOUR_Project_ID> --runner=DataflowRunner --bigQueryDatasetId=test  --outputBigQueryTable=annotate_variant_transcript_test_chr17 --genericAnnotationTables=<YOUR_Project_ID>:test.sample_transcript_annotation_chr17:name:name2  --variantAnnotationTables=<YOUR_Project_ID>:test.sample_variant_annotation_chr17:alleleFreq:dbsnpid --VCFTables=<YOUR_Project_ID>:test.NA12877_chr17 --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/staging" -Pdataflow-runner

Google VCF Table

If the VCF table was imported using Google APIs, then set --googleVCF=true. Here is a test example for the 1000 Genomes Project mVCF file imported by Google Genomics.

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=<YOUR_Project_ID> --runner=DataflowRunner --bigQueryDatasetId=test  --outputBigQueryTable=annotate_variant_transcript_Google_1000_test_chr17 --genericAnnotationTables=<YOUR_Project_ID>:test.sample_transcript_annotation_chr17:name:name2  --variantAnnotationTables=<YOUR_Project_ID>:test.sample_variant_annotation_chr17:alleleFreq:dbsnpid --VCFTables=genomics-public-data:1000_genomes_phase_3.variants --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/staging --googleVCF=true" -Pdataflow-runner

Now, if you want AnnotationHive to calculate the number of samples that have variants, then set --numberSamples=true. Here is a test example for the 1000 Genomes Project mVCF file imported by Google Genomics (Total number of samples in the mVCF file: 2,504). AnnotationHive will find the number of samples that have variants:

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=YOUR_Project_ID --runner=DataflowRunner --bigQueryDatasetId=test  --outputBigQueryTable=annotate_variant_transcript_Google_1000_test_chr17_with_num_samples --genericAnnotationTables=<YOUR_Project_ID>:test.sample_transcript_annotation_chr17:name:name2  --variantAnnotationTables=<YOUR_Project_ID>:test.sample_variant_annotation_chr17:alleleFreq:dbsnpid --VCFTables=genomics-public-data:1000_genomes_phase_3.variants --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/staging --googleVCF=true --numberSamples=true" -Pdataflow-runner