MERGE INTO sends Trino Coordinator into constant GC state due to highly fragmented Iceberg tables #25192

vburenin · 2025-02-28T14:25:26Z

Trino Coordinator may go out of service if it opens enough highly fragmented partitioned Iceberg tables during MERGE INTO operation. I've been fighting an issue recently where a bunch of clusters were very unstable OOMing on coordinator side. I increased allocation to 240GB of JVM heap and it was quickly running out of RAM too. I traced it down to an unsuccessfully maintained Iceberg tables that became highly fragmented (tons of files like 1-3Kb in size). One table doesn't cause the issue, several such tables together cause the problem. The last thing that coordinator does is reading a lot of ./.../metadata/*.avro files and then it is done according to Envoy logs.

In my setup reproduction step would require having ~10 50-100GB tables that have millions of parquet files of 1-5kb in size. Simultaneous MERGE INTO operation that updates good number of partitions in those tables exhaust Coordinator RAM very quickly and it goes into a sad state of a constant GC.

I am not sure there is any good resolution possible on Trino end, but auto detection of highly fragmented tables that may cause troubles would be a good start.

vburenin · 2025-03-03T15:34:13Z

It comes a bits as a surprise, we see excessive number of small files of 1.5-3kb in size after Merge INTO operations, like hundreds per partition. When we were running Trino 419 it was not the case. Of course, we are trying to run OPTIMIZE regularly to counter that. The SQL query that produces data MERGE INTO input also ensures that data is sorted by partition, which is in case of Spark reduced small files fragmentation and seems like have a 0 effect in Trino.

vburenin · 2025-03-03T22:10:15Z

Apparently those small files are DELETE files that are never deleted when I run "ALTER TABLE table EXECUTE optimize".

hashhar · 2025-03-05T07:29:09Z

OPTIMIZE can only remove files if they are not referenced by any snapshots. How many snapshots do you have?

vburenin · 2025-03-05T16:16:49Z

@hashhar Just one, the last one. It is not about the number of snapshots, it is about snapshot itself. Even if it is 1 snapshot, very large number of DELETE files kills performance. For some reason running OPTIMIZE doesn't get rid of dangling empty delete files. I see that there is some development in this regard in Iceberg starting from 1.7 version.

vburenin · 2025-03-05T16:19:44Z

There is certainly a difference between how MERGE INTO is handled by Trino VS Spark. Even with Iceberg 1.5 this issue never comes up in Spark environment.

vburenin · 2025-03-05T16:27:34Z

Here is some log for reference from Spark while I am running DELETE files compactions:

25/03/05 16:26:21 INFO RewritePositionDeleteFilesSparkAction: Rewrite position deletes ready to be committed - Rewriting 6918 position delete files (BIN-PACK, file group 489/1711, org.apache.iceberg.util.StructProjection@6a9c3853 (1/1)) in iceberg.etl_schema.etl_1690235017985880351_some_random_table_name

ebyhr added the iceberg Iceberg connector label Mar 3, 2025

vburenin mentioned this issue Mar 3, 2025

Very frequent network timeouts after upgrade from Trino 466 to 470 #25102

Closed

dain marked this as a duplicate of #25102 Mar 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MERGE INTO sends Trino Coordinator into constant GC state due to highly fragmented Iceberg tables #25192

MERGE INTO sends Trino Coordinator into constant GC state due to highly fragmented Iceberg tables #25192

vburenin commented Feb 28, 2025

vburenin commented Mar 3, 2025

vburenin commented Mar 3, 2025 •

edited

Loading

hashhar commented Mar 5, 2025

vburenin commented Mar 5, 2025

vburenin commented Mar 5, 2025

vburenin commented Mar 5, 2025

MERGE INTO sends Trino Coordinator into constant GC state due to highly fragmented Iceberg tables #25192

MERGE INTO sends Trino Coordinator into constant GC state due to highly fragmented Iceberg tables #25192

Comments

vburenin commented Feb 28, 2025

vburenin commented Mar 3, 2025

vburenin commented Mar 3, 2025 • edited Loading

hashhar commented Mar 5, 2025

vburenin commented Mar 5, 2025

vburenin commented Mar 5, 2025

vburenin commented Mar 5, 2025

vburenin commented Mar 3, 2025 •

edited

Loading