Skip to content

Vs 1736 - changing import to use parquet#9301

Draft
koncheto-broad wants to merge 44 commits intoah_var_storefrom
VS-1736
Draft

Vs 1736 - changing import to use parquet#9301
koncheto-broad wants to merge 44 commits intoah_var_storefrom
VS-1736

Conversation

@koncheto-broad
Copy link
Collaborator

This pull request is an extension on much older work to modify ingest to produce parquet files (for Azure, at the time) instead of writing to BigQuery tables. This PR, as part of the 1736 spike, modifies our ingest process to directly load those parquet files into BQ using free APIs instead of the costly write api.

koncheto-broad and others added 30 commits September 19, 2025 09:48
…ting in no matches when directory path was searched
koncheto-broad and others added 14 commits November 18, 2025 14:31
* Update to latest ah_var_store
* Fix some WDL syntax errors
* Disable by default, for now 'ConfigureParquetLifecycle'
* Pin gcnvkernel dependency for Python 3.10, other build fixes [VS-1789] (#9316)

---------

Co-authored-by: Miguel Covarrubias <mcovarr@users.noreply.github.com>
* Include sample id in Parquet file names.
* Store sample id in Parquet tracking table.
* Added checking for None in parsing out sample_id from parquet file name.

---------

Co-authored-by: Miguel Covarrubias <mcovarr@broadinstitute.org>
* Fixed the tests.
* Updated the gatk docker.
This PR Updates the lifecycle config strategy for parquet so that updates are possible.
This PR adds a task to delete the parquet files once they are done being used. As there was controversy as to how to delete large amounts of files, it allows for an alternate deletion strategy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants