Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
5d37f97
pulling over initial parquet changes
koncheto-broad Mar 12, 2024
f0a3f15
finalizing merge
koncheto-broad Oct 16, 2024
41f3a26
cherry picking in parquet changes
koncheto-broad Mar 12, 2024
3caef40
finishing cherry picking
koncheto-broad Mar 19, 2024
6e4fd32
small update
koncheto-broad Mar 20, 2024
3958d6b
finally getting the parquet copying part sorted out
koncheto-broad Mar 20, 2024
20eb9cb
suffering a file name clash and overriding the same file. Changing t…
koncheto-broad Mar 20, 2024
d90990d
suffering a file name clash and overriding the same file. Changing t…
koncheto-broad Mar 20, 2024
da7254b
no more build issues
koncheto-broad Oct 8, 2025
2d48b4a
Cherry picking in and modifying parquet code for header and ref data
RoriCremer Jun 7, 2024
0969d88
updating header and ref code
koncheto-broad Oct 9, 2025
2022582
Updating vet schema to match current
koncheto-broad Oct 14, 2025
6606187
pushing wdl changes
koncheto-broad Oct 14, 2025
5997158
dockstore
koncheto-broad Oct 14, 2025
7d04210
turning on ref ranges creation
koncheto-broad Oct 15, 2025
fa37d5b
updating stupid issue with changing bash variable name
koncheto-broad Oct 15, 2025
5358f1b
making directory names consistent
koncheto-broad Oct 16, 2025
ca6d213
Implementing the doc, updating tests, and building the new docker
koncheto-broad Oct 28, 2025
09f3891
consolidating changes into standard import genomes workflow
koncheto-broad Oct 28, 2025
b2e6532
Prepping for test run--more work to be done integrating parquet with …
koncheto-broad Oct 28, 2025
d4a6d76
normalizing 'done' types
koncheto-broad Oct 29, 2025
a27a7d7
fixing incorrect docker images
koncheto-broad Oct 29, 2025
1fdf830
getting rid of billing project argument to see if the bucket listing …
koncheto-broad Oct 29, 2025
6ad2531
Getting rid of billing project input instead of commenting it out
koncheto-broad Oct 29, 2025
90d5331
Fixing issue with code creating multiple trailing slashes, thus resul…
koncheto-broad Oct 29, 2025
6f9aad5
Fixing weird ref_ranges -> ref name remapping and fixing WDL error th…
koncheto-broad Oct 29, 2025
4d5ef6b
Updating tests and variants docker
koncheto-broad Oct 29, 2025
17aed13
updating the parquet file loading to better handle interruptions
koncheto-broad Nov 17, 2025
7a26988
bug fix
koncheto-broad Nov 18, 2025
24949b7
updating docker
koncheto-broad Nov 18, 2025
1dd6e2c
making this more idempotent
koncheto-broad Nov 18, 2025
c883dd2
Adding lifecycle rules so the parquet files auto-delete after 14 days
koncheto-broad Nov 21, 2025
934b83a
Adding useful documentation
koncheto-broad Dec 3, 2025
0e1e978
moving docs around
koncheto-broad Dec 8, 2025
fb79cf5
VS-1785 - Update Parquet branch to latest in ah_var_store (#9310)
gbggrant Jan 28, 2026
c0737c5
VS-1810 add sample id to parquet tracking table (#9333)
gbggrant Feb 19, 2026
4781a93
VS-1793 investigate failing tests (#9335)
gbggrant Feb 24, 2026
6832cbe
Restore accidentally removed CreateVariantIngestFiles code [VS-1819] …
mcovarr Feb 25, 2026
7bc6571
Set sample_info.is_loaded for Parquet ingest [VS-1780] (#9320)
gbggrant Feb 25, 2026
cf0005f
Prep for merge [VS-1800] (#9327)
mcovarr Feb 26, 2026
d4f0251
Fix Parquet exome NPE, compressed refs [VS-1799] [VS-1801] (#9329)
mcovarr Mar 5, 2026
f992427
VS-1794 parquet removal strategy (#9337)
gbggrant Mar 6, 2026
6f1c838
Parquet ploidy loading [VS-1809] (#9334)
mcovarr Mar 9, 2026
71a586d
VS-1779 clean up parquet files immediatement (#9338)
gbggrant Mar 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 82 additions & 2 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -211,17 +211,32 @@ workflows:
branches:
- master
- ah_var_store
- VS-1736
- vs_1800_prep_for_merge
tags:
- /.*/
- name: GvsBulkIngestGenomes
- name: BulkIngestWriteAPI
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl
filters:
branches:
- master
- ah_var_store
- vs_1799_fix_parquet_exome_npe
- vs_1809_parquet_ploidy
tags:
- /.*/
- name: BulkIngestParquet
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl
filters:
branches:
- master
- ah_var_store
- vs_1799_fix_parquet_exome_npe
- vs_1809_parquet_ploidy
tags:
- /.*/
- name: GvsPrepareRangesCallset
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsPrepareRangesCallset.wdl
Expand Down Expand Up @@ -343,9 +358,39 @@ workflows:
branches:
- master
- ah_var_store
- vs_1730_clusters_leaking
- vs_1800_prep_for_merge
tags:
- /.*/
- name: GvsQuickstartIntegrationParquet1809
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsQuickstartIntegration.wdl
filters:
branches:
- master
- ah_var_store
- vs_1809_parquet_ploidy
tags:
- /.*/
- name: GvsQuickstartIntegrationWriteAPI1809
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsQuickstartIntegration.wdl
filters:
branches:
- master
- ah_var_store
- vs_1809_parquet_ploidy
tags:
- /.*/
- name: GvsQuickstartIntegrationBGEOnlyParquet1809
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsQuickstartIntegration.wdl
filters:
branches:
- master
- ah_var_store
- vs_1809_parquet_ploidy
tags:
- /.*/
- name: GvsIngestTieout
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsIngestTieout.wdl
Expand Down Expand Up @@ -462,6 +507,14 @@ workflows:
branches:
- master
- ah_var_store
- name: GvsMapUnmappedVIDs
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/variant-annotations-table/GvsMapUnmappedVIDs.wdl
filters:
branches:
- master
- ah_var_store
- vs_1757_dropped_duplicates
- name: MergePgenHierarchicalWdl
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/MergePgenHierarchical.wdl
Expand All @@ -481,3 +534,30 @@ workflows:
filters:
branches:
- ah_var_store
- name: SearchGVCFsAtSite
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/variant-annotations-table/SearchGVCFsAtSite.wdl
filters:
branches:
- ah_var_store
- dst_2716_an_divergence_echo_foxtrot
- name: GvsCreateParticipantMappingTable
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/variant-annotations-table/GvsCreateParticipantMappingTable.wdl
filters:
branches:
- master
- ah_var_store
- vs_1757_dropped_duplicates
tags:
- /.*/
- name: GvsMapDroppedDuplicateVIDs
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/variant-annotations-table/GvsMapDroppedDuplicateVIDs.wdl
filters:
branches:
- master
- ah_var_store
- vs_1757_dropped_duplicates
tags:
- /.*/
4 changes: 2 additions & 2 deletions .github/workflows/gatk-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ jobs:
#Google Cloud stuff
- id: 'gcloud-auth'
if: needs.check-secrets.outputs.google-credentials == 'true'
uses: google-github-actions/auth@v0
uses: google-github-actions/auth@v3
with:
credentials_json: ${{ secrets.GCP_CREDENTIALS }}
project_id: ${{ env.HELLBENDER_TEST_PROJECT }}
Expand Down Expand Up @@ -175,7 +175,7 @@ jobs:

#Google Cloud stuff
- id: 'gcloud-auth'
uses: google-github-actions/auth@v0
uses: google-github-actions/auth@v3
if: needs.check-secrets.outputs.google-credentials == 'true'
with:
credentials_json: ${{ secrets.GCP_CREDENTIALS }}
Expand Down
9 changes: 8 additions & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ final guavaVersion = System.getProperty('guava.version', '32.1.3-jre')
final log4j2Version = System.getProperty('log4j2Version', '2.24.1')
final testNGVersion = System.getProperty('testNGVersion', '7.7.0')
final googleCloudNioVersion = System.getProperty('googleCloudNioVersion','0.127.8')
final gklVersion = System.getProperty('gklVersion', '0.9.0')
final gklVersion = System.getProperty('gklVersion', '0.9.1')

final baseJarName = 'gatk'
final secondaryBaseJarName = 'hellbender'
Expand Down Expand Up @@ -389,6 +389,13 @@ dependencies {
// pgen jni
implementation('org.broadinstitute:pgenjni:1.0.1')

// parquet writing
implementation('org.apache.parquet:parquet-common:1.13.1')
implementation('org.apache.parquet:parquet-encoding:1.13.1')
implementation('org.apache.parquet:parquet-column:1.13.1')
implementation('org.apache.parquet:parquet-hadoop:1.13.1')
implementation 'org.apache.parquet:parquet-avro:1.13.1'

testUtilsImplementation sourceSets.main.output
testUtilsImplementation 'org.testng:testng:' + testNGVersion
testUtilsImplementation 'org.apache.hadoop:hadoop-minicluster:' + hadoopVersion
Expand Down
14 changes: 14 additions & 0 deletions build.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Option -e requires an argument.
Usage: ./build_docker.sh: -e <GITHUB_TAG> [-psl]
where <GITHUB_TAG> is the github tag (or hash when -s is used) to use in building the docker image
(e.g. bash build_docker.sh -e 1.0.0.0-alpha1.2.1)
Optional arguments:
-s The GITHUB_TAG (-e parameter) is actually a github hash, not tag. git hashes cannot be pushed as latest, so -l is implied.
-l Do not also push the image to the 'latest' tag.
-u Do not run the unit tests.
-m Build the lite image (which does not contain the conda environment).
-d <STAGING_DIR> staging directory to grab code from repo and build the docker image. If unspecified, then use whatever is in current dir (do not go to the repo). NEVER SPECIFY YOUR WORKING DIR
-p (GATK4 developers only) push image to docker hub once complete. This will use the GITHUB_TAG in dockerhub as well.
Unless -l is specified, this will also push this image to the 'latest' tag.
-r (GATK4 developers only) Do not remove the unit test docker container. This is useful for debugging failing unit tests.
-t <PULL_REQUEST_NUMBER> (Travis CI only) The pull request number. This is only used during pull request builds on Travis CI.
87 changes: 87 additions & 0 deletions docs/gvs_ingest_refactoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# GVS Ingest Package Refactoring

## Overview

The `org.broadinstitute.hellbender.tools.gvs.ingest` package has been refactored to support multiple output formats through a consistent interface-based architecture. This refactoring separates concerns by format type and introduces polymorphism for easier addition of new output formats.

## Package Structure

```
src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/
├── VetWriter.java (interface)
├── RefRangesWriter.java (interface)
├── SamplePloidyWriter.java (interface)
├── avro/
│ └── RefRangesAvroWriter.java
├── bq/
│ ├── AbstractBQWriter.java (abstract base)
│ ├── VetBQWriter.java
│ ├── RefRangesBQWriter.java
│ └── SamplePloidyBQWriter.java
├── parquet/
│ ├── AbstractParquetFileWriter.java (abstract base)
│ ├── VetParquetFileWriter.java
│ ├── RefRangesParquetFileWriter.java
│ ├── SamplePloidyParquetFileWriter.java
│ └── HeaderParquetFileWriter.java
└── tsv/
├── VetTsvWriter.java
└── RefRangesTsvWriter.java
```

## Key Design Elements

### Writer Interfaces

Three core interfaces define the write operations for GVS data types:

- **`VetWriter`** - Writes variant information (location, sample ID, variant context)
- **`RefRangesWriter`** - Writes reference ranges (location, length, state, compressed data)
- **`SamplePloidyWriter`** - Writes chromosome ploidy information

All interfaces extend `Closeable` and include a default `commitData()` no-op method that implementations can override.

### Abstract Base Classes

Two abstract base classes provide shared infrastructure for their respective output formats:

- **`AbstractBQWriter`** - Handles BigQuery Write API operations via `PendingBQWriter`
- Provides `write(JSONObject)` for adding rows
- Implements `commitData()` to flush and commit write streams

- **`AbstractParquetFileWriter`** - Handles Parquet file operations
- Uses composition pattern with `ParquetWriteSupport`
- Provides `write(JSONObject)` for writing records
- Manages compression and schema configuration

### Format-Specific Implementations

#### BigQuery (`bq/`)
- `VetBQWriter`, `RefRangesBQWriter`, `SamplePloidyBQWriter`
- Extend `AbstractBQWriter` and implement respective interfaces
- Write data directly to BigQuery via the Write API

#### Parquet (`parquet/`)
- `VetParquetFileWriter`, `RefRangesParquetFileWriter`, `SamplePloidyParquetFileWriter`
- Extend `AbstractParquetFileWriter` and implement respective interfaces
- Write data to local Parquet files for later bulk loading

#### TSV (`tsv/`)
- `VetTsvWriter`, `RefRangesTsvWriter`
- Direct implementations for tab-separated value output

#### Avro (`avro/`)
- `RefRangesAvroWriter`
- Legacy format support

## Benefits

1. **Separation of Concerns**: Output format logic is isolated in dedicated packages
2. **Extensibility**: New formats can be added by implementing the writer interfaces
3. **Code Reuse**: Abstract base classes eliminate duplication across similar implementations
4. **Type Safety**: Interface contracts ensure consistent APIs across formats
5. **Testability**: Each implementation can be tested independently

## Note

VCF header writing does not yet have a dedicated interface; that should be added when VS-1803 (support writing VCF header data with Parquet) is implemented.
Loading