Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy #3118

koenvo · 2026-03-03T12:38:33Z

koenvo
Mar 3, 2026

Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy

There are multiple ongoing discussions around upsert, including:

I’ve previously worked on upsert internals (e.g. #1878 on complex type comparison fallbacks and #1995 on transaction semantics and batch scanning). While working in this area, I started thinking that we may benefit from clarifying the intended positioning and strategy of upsert.

I’d like to frame this discussion from workload impact first, then implementation.

1. Start from impact

In certain workloads, especially sparse updates into large tables, upsert can cause disproportionate:

IO
memory usage
small file creation

relative to the number of rows modified.

This suggests we may be implicitly optimizing for one class of workload while others pay a high cost.

2. Representative use cases

Some distinct scenarios:

1. Sparse micro updates

Upsert 10 random rows into a 10M row table, bucketed by user_id.
Expectation:

Minimal file rewrites
Predictable memory usage
No explosion of small files

2. Incremental ingestion

10k to 200k rows per batch, mostly inserts, some updates.
Expectation:

Good write throughput
Stable resource usage

3. Large backfill / bulk merge

Millions of rows, high match rate.
Expectation:

Efficient batch processing
Predictable rewrite cost

4. Memory constrained environments

Small containers or serverless.
Expectation:

No pathological memory spikes
No extreme expression or join planning overhead

It would help to clarify which of these upsert is primarily optimized for.

3. Current conceptual model

Conceptually, upsert behaves like:

Delete matching rows
Insert new rows

For Copy on Write behavior, the delete step already rewrites the data files containing matching rows, even if only a single row matches.

The insert step then writes updated rows again as new data files.

In sparse scenarios, this can mean:

Full file rewrites for single row updates
Additional small files created by the insert step
Redundant IO for rows that are effectively being replaced

This is logically correct, but not always optimal.

4. Alternative framing: file level rewrite

An alternative way to conceptualize upsert (for Copy on Write style tables) would be:

Identify affected data files
Load each affected file once
Apply updates in memory
Write merged files
Commit as file replacements in a single snapshot

Instead of modeling upsert as delete + insert, model it as:

Rewrite affected files with merged content.

This could reduce redundant IO and small file creation, at the cost of different memory and planning trade offs. This is similar to how AWS Athena does a MERGE INTO.

More broadly, there appear to be at least two conceptual strategies:

Merge on Read style: delete files + append
Copy on Write style: rewrite affected data files

It may be worth documenting or formalizing this distinction, even if not immediately user configurable.

5. Python vs Rust execution

In some discussions, there is an implicit suggestion that deeper optimization should wait for a Rust execution layer.

That is a valid long term direction. However, it would help to clarify:

Is the Python upsert intended to be production grade and performance competitive?
Or is it primarily a functional implementation that will eventually defer heavy execution to Rust?

If Python is first class, then strategy clarity and optimization likely matter now.

If not, that should be documented so expectations are aligned.

6. Main questions

Which workload is upsert primarily optimized for?
Should strategy (MoR vs CoW style planning) be explicit or configurable?
Should we document clear trade offs and expected behavior per workload?
What is the intended long term role of Python vs Rust in execution?

I’m happy to help structure documentation or experiments around these workload categories.

EnyMan · 2026-03-10T12:56:33Z

EnyMan
Mar 10, 2026

Hi

Thanks for putting this discussion together. As the author of PR #2943, I wanted to weigh in here with my findings, as my recent attempts to optimize upserts touch directly on the questions you raised.

To answer your prompt about use cases and strategy, here is where my team’s workload fits in and what we’ve learned about PyIceberg’s current limits.

1. Our Use Case: A mix of Use Case 1 and Use Case 3 (Large Backfill + Sparse Updates via Time Travel)
We fall somewhere between your defined use cases 1 and 3. However, we are a bit of a special case because we are pushing Iceberg's time-travel capabilities in a way it wasn't strictly designed for.

Scale & Memory (Touching on Use Case 4): We manage tables ranging from 3M x 1 up to 34M x 80 (Rows x Columns). Because we have strings alongside floats, an upsert of 34M rows currently eats about 80GB of RAM. While we aren't strictly bound by machine limits, keeping memory down is critical for cost efficiency.
The "Time Travel" Backfill: We backfill data as if it was computed daily from 2021 to today. We constantly fake the clock (using time_machine library) to fetch data up to midnight of the previous day and save it with yesterday's timestamp. This ensures time-travel queries return data strictly inclusive of that date. We rarely get pure deltas; we get full snapshots.

2. Trade-Offs & The Conceptual Model: CoW vs. Exponential Growth
Regarding section 3 and 4 of your post, the biggest issue we hit with the current "delete and re-append" CoW model is the combination of filtering, batching, and exponential file growth.

Ditching Clever Partitioning: Initially, we tried a clever partitioning strategy to store the same entry fewer times than our snapshot count. We had to abandon this completely because the exponential time growth between an upsert and the sheer number of resulting files/partitions was unmanageable. We now rely solely on bucket partitions.
The Compaction Gap: Because PyIceberg currently only does Copy-on-Write and lacks a built-in compaction mechanism, the file growth is a blocker at our scale. To make this work for us, I had to implement the write procedure (which I plan to PR later). And lastly, I had to remove the batching, which, in all honesty, is a great idea but, coupled with big filters, was causing severe slowdown. With these local changes plus the optimizations from Optimize upsert performance for large datasets #2943, that 34M row upsert takes about 5 minutes (including compaction), and file growth scales linearly with partitions rather than exponentially with upserts. Also, upserts to a table with 5M rows and 4 columns (no strings only floats) take 40s and consume only <2GB of RAM.

3. Python vs. Rust Execution (Answering Section 5)
To answer your question directly: I view PyIceberg as a necessary bridge. I recently briefly(i am no rust expert and i have never seen the repo so I might have missed bunch of things) looked at rust-iceberg, and while it is the clear long-term solution, there is still a lot of work to be done before it can handle this kind of heavy lifting fully (support for inserts and overwrite is coming along nicely). There is some functionality in the core, mainly TableScan, but I haven't found the bindings for it. If we can get this done, that could prove a good speed-up step.

I think we should do some smaller performance optimization, no need for a massive rewrite, to make the upserts a bit more performant to support these various workloads better while we wait for Rust to mature.

Next Steps for PR #2943
As suggested by reviewers on my PR, I agree that #2943 tries to do too much at once to solve these bottlenecks. Based on this discussion, I plan to close the monolithic PR and break it down into much smaller, targeted edits that slightly speed up the upsert operation without sacrificing readability.

Looking forward to hearing your thoughts on this!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy #3118

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy #3118

Uh oh!

koenvo Mar 3, 2026

Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy

1. Start from impact

2. Representative use cases

1. Sparse micro updates

2. Incremental ingestion

3. Large backfill / bulk merge

4. Memory constrained environments

3. Current conceptual model

4. Alternative framing: file level rewrite

5. Python vs Rust execution

6. Main questions

Replies: 1 comment

Uh oh!

EnyMan Mar 10, 2026

koenvo
Mar 3, 2026

EnyMan
Mar 10, 2026