@@ -14,7 +14,6 @@ Default settings:
1414 --worker_machine_type <default n1-standard-1> \
1515 --disk_size_gb <default 250> \
1616 --worker_disk_type <default PD> \
17- --num_bigquery_write_shards <default 1> \
1817 --partition_config_path <default None> \
1918```
2019
@@ -98,8 +97,7 @@ transforms (e.g. the sample name is repeated in every record in the BigQuery
9897output rather than just being specified once as in the VCF header), you
9998typically need 3 to 4 times the total size of the raw VCF files.
10099
101- In addition, if [ merging] ( variant_merging.md ) or
102- [ --num_bigquery_write_shards] ( #--num_bigquery_write_shards ) is enabled, you may
100+ In addition, if [ merging] ( variant_merging.md ) is enabled, you may
103101need more disk per worker (e.g. 500GB) as the same variants need to be
104102aggregated together on one machine.
105103
@@ -110,32 +108,14 @@ more expensive. However, when choosing a large machine (e.g. `n1-standard-16`),
110108they can reduce cost as they can avoid idle CPU cycles due to disk IOPS
111109limitations.
112110
113- As a result, we recommend using SSDs if [ merging] ( variant_merge.md ) or
114- [ --num_bigquery_write_shards] ( #--num_bigquery_write_shards ) is enabled: these
115- operations require "shuffling" the data (i.e. redistributing the data among
116- workers), which require significant disk I/O.
111+ As a result, we recommend using SSDs if [ merging] ( variant_merge.md ) is enabled:
112+ this operation requires "shuffling" the data (i.e. redistributing the data
113+ among workers), which requires significant disk I/O.
117114
118115Set
119116` --worker_disk_type compute.googleapis.com/projects//zones//diskTypes/pd-ssd `
120117to use SSDs.
121118
122- ### ` --num_bigquery_write_shards `
123-
124- Currently, the write operation to BigQuery in Dataflow is performed as a
125- postprocessing step after the main transforms are done. As a workaround for
126- BigQuery write limitations (more details
127- [ here] ( https://github.com/googlegenomics/gcp-variant-transforms/issues/199 ) ),
128- we have added "sharding" when writing to BigQuery. This makes the data load
129- to BigQuery significantly faster as it parallelizes the process and enables
130- loading large (>5TB) data to BigQuery at once.
131-
132- As a result, we recommend setting ` --num_bigquery_write_shards 20 ` when loading
133- any data that has more than 1 billion rows (after merging) or 1TB of final
134- output. You may use a smaller number of write shards (e.g. 5) when using
135- [ partitioned output] ( #--partition_config_path ) as each partition also acts as a
136- "shard". Note that using a larger value (e.g. 50) can cause BigQuery write to
137- fail as there is a maximum limit on the number of concurrent writes per table.
138-
139119### ` --partition_config_path `
140120
141121Partitioning the output can save significant query costs once the data is in
@@ -146,4 +126,3 @@ partition).
146126As a result, we recommend setting the partition config for very large data
147127where possible. Please see the [ documentation] ( partitioning.md ) for more
148128details.
149-
0 commit comments