Skip to content

Commit

Permalink
fix(docs): correct snowflake options for bulk ingest (#2004)
Browse files Browse the repository at this point in the history
It was brought up in #1997 that the currently published options for
snowflake bulk ingestion are incorrect in the docs. This corrects them
to the proper values that are consistent with the values in the
implementation.

This also adds new constants to the python `StatementOptions` enum for
the snowflake driver for users to reference.
  • Loading branch information
zeroshade authored Jul 12, 2024
1 parent 6c7ad99 commit a5f8474
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 4 deletions.
8 changes: 4 additions & 4 deletions docs/source/driver/snowflake.rst
Original file line number Diff line number Diff line change
Expand Up @@ -285,23 +285,23 @@ The following informal benchmark demonstrates expected performance using default
The default settings for ingestion should be well balanced for many real-world configurations. If required, performance
and resource usage may be tuned with the following options on the :cpp:class:`AdbcStatement` object:

``adbc.snowflake.rpc.ingest_writer_concurrency``
``adbc.snowflake.statement.ingest_writer_concurrency``
Number of Parquet files to write in parallel. Default attempts to maximize workers based on logical cores detected,
but may need to be adjusted if running in a constrained environment. If set to 0, default value is used. Cannot be negative.

``adbc.snowflake.rpc.ingest_upload_concurrency``
``adbc.snowflake.statement.ingest_upload_concurrency``
Number of Parquet files to upload in parallel. Greater concurrency can smooth out TCP congestion and help make
use of available network bandwith, but will increase memory utilization. Default is 8. If set to 0, default value is used.
Cannot be negative.

``adbc.snowflake.rpc.ingest_copy_concurrency``
``adbc.snowflake.statement.ingest_copy_concurrency``
Maximum number of COPY operations to run concurrently. Bulk ingestion performance is optimized by executing COPY
queries as files are still being uploaded. Snowflake COPY speed scales with warehouse size, so smaller warehouses
may benefit from setting this value higher to ensure long-running COPY queries do not block newly uploaded files
from being loaded. Default is 4. If set to 0, only a single COPY query will be executed as part of ingestion,
once all files have finished uploading. Cannot be negative.

``adbc.snowflake.rpc.ingest_target_file_size``
``adbc.snowflake.statement.ingest_target_file_size``
Approximate size of Parquet files written during ingestion. Actual size will be slightly larger, depending on
size of footer/metadata. Default is 10 MB. If set to 0, file size has no limit. Cannot be negative.

Expand Down
18 changes: 18 additions & 0 deletions python/adbc_driver_snowflake/adbc_driver_snowflake/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,24 @@ class StatementOptions(enum.Enum):
#: Number of concurrent streams being prefetched for a result set.
#: Defaults to 10.
PREFETCH_CONCURRENCY = "adbc.snowflake.rpc.prefetch_concurrency"
#: Number of parquet files to write in parallel for bulk ingestion
#: Defaults to NumCPU
INGEST_WRITER_CONCURRENCY = "adbc.snowflake.statement.ingest_writer_concurrency"
#: Number of parquet files to upload in parallel. Greater concurrency can
#: smooth out congestion and make use of available network bandwidth but will
#: increase memory utilization. Cannot be negative. Defaults to 8
INGEST_UPLOAD_CONCURRENCY = "adbc.snowflake.statement.ingest_upload_concurrency"
#: Maximum number of COPY operations to run concurrently for bulk ingestion.
#: Bulk ingestion performance is optimized by executing COPY queries as files are
#: still being uploaded, Snowflake COPY speed scales with warehouse size. So smaller
#: warehouses might benefit from a higher setting to prevent a long-running COPY
#: query from blocking others from being loaded. Default is 4.
INGEST_COPY_CONCURRENCY = "adbc.snowflake.statement.ingest_copy_concurrency"
#: Approximate size of Parquet files written during ingestion. Actual size will be
#: slightly larger due to size of footer/metadata. Does not account for batch size,
#: so if the input stream produces very large batches, you'll get similar sized
#: parquet files. Default is 10MB
INGEST_TARGET_FILE_SIZE = "adbc.snowflake.statement.ingest_target_file_size"


def connect(
Expand Down

0 comments on commit a5f8474

Please sign in to comment.