Replies: 1 comment
-
|
Hi Thanks for putting this discussion together. As the author of PR #2943, I wanted to weigh in here with my findings, as my recent attempts to optimize upserts touch directly on the questions you raised. To answer your prompt about use cases and strategy, here is where my team’s workload fits in and what we’ve learned about PyIceberg’s current limits. 1. Our Use Case: A mix of Use Case 1 and Use Case 3 (Large Backfill + Sparse Updates via Time Travel)
2. Trade-Offs & The Conceptual Model: CoW vs. Exponential Growth
3. Python vs. Rust Execution (Answering Section 5)
Next Steps for PR #2943 Looking forward to hearing your thoughts on this! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy
There are multiple ongoing discussions around
upsert, including:I’ve previously worked on
upsertinternals (e.g. #1878 on complex type comparison fallbacks and #1995 on transaction semantics and batch scanning). While working in this area, I started thinking that we may benefit from clarifying the intended positioning and strategy ofupsert.I’d like to frame this discussion from workload impact first, then implementation.
1. Start from impact
In certain workloads, especially sparse updates into large tables,
upsertcan cause disproportionate:relative to the number of rows modified.
This suggests we may be implicitly optimizing for one class of workload while others pay a high cost.
2. Representative use cases
Some distinct scenarios:
1. Sparse micro updates
Upsert 10 random rows into a 10M row table, bucketed by
user_id.Expectation:
2. Incremental ingestion
10k to 200k rows per batch, mostly inserts, some updates.
Expectation:
3. Large backfill / bulk merge
Millions of rows, high match rate.
Expectation:
4. Memory constrained environments
Small containers or serverless.
Expectation:
It would help to clarify which of these
upsertis primarily optimized for.3. Current conceptual model
Conceptually,
upsertbehaves like:For Copy on Write behavior, the delete step already rewrites the data files containing matching rows, even if only a single row matches.
The insert step then writes updated rows again as new data files.
In sparse scenarios, this can mean:
This is logically correct, but not always optimal.
4. Alternative framing: file level rewrite
An alternative way to conceptualize upsert (for Copy on Write style tables) would be:
Instead of modeling upsert as delete + insert, model it as:
Rewrite affected files with merged content.
This could reduce redundant IO and small file creation, at the cost of different memory and planning trade offs. This is similar to how AWS Athena does a
MERGE INTO.More broadly, there appear to be at least two conceptual strategies:
It may be worth documenting or formalizing this distinction, even if not immediately user configurable.
5. Python vs Rust execution
In some discussions, there is an implicit suggestion that deeper optimization should wait for a Rust execution layer.
That is a valid long term direction. However, it would help to clarify:
upsertintended to be production grade and performance competitive?If Python is first class, then strategy clarity and optimization likely matter now.
If not, that should be documented so expectations are aligned.
6. Main questions
upsertprimarily optimized for?I’m happy to help structure documentation or experiments around these workload categories.
Beta Was this translation helpful? Give feedback.
All reactions