Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MERGE INTO sends Trino Coordinator into constant GC state due to highly fragmented Iceberg tables #25192

Open
vburenin opened this issue Feb 28, 2025 · 6 comments
Labels
iceberg Iceberg connector

Comments

@vburenin
Copy link

Trino Coordinator may go out of service if it opens enough highly fragmented partitioned Iceberg tables during MERGE INTO operation. I've been fighting an issue recently where a bunch of clusters were very unstable OOMing on coordinator side. I increased allocation to 240GB of JVM heap and it was quickly running out of RAM too. I traced it down to an unsuccessfully maintained Iceberg tables that became highly fragmented (tons of files like 1-3Kb in size). One table doesn't cause the issue, several such tables together cause the problem. The last thing that coordinator does is reading a lot of ./.../metadata/*.avro files and then it is done according to Envoy logs.

In my setup reproduction step would require having ~10 50-100GB tables that have millions of parquet files of 1-5kb in size. Simultaneous MERGE INTO operation that updates good number of partitions in those tables exhaust Coordinator RAM very quickly and it goes into a sad state of a constant GC.

I am not sure there is any good resolution possible on Trino end, but auto detection of highly fragmented tables that may cause troubles would be a good start.

@ebyhr ebyhr added the iceberg Iceberg connector label Mar 3, 2025
@vburenin
Copy link
Author

vburenin commented Mar 3, 2025

It comes a bits as a surprise, we see excessive number of small files of 1.5-3kb in size after Merge INTO operations, like hundreds per partition. When we were running Trino 419 it was not the case. Of course, we are trying to run OPTIMIZE regularly to counter that. The SQL query that produces data MERGE INTO input also ensures that data is sorted by partition, which is in case of Spark reduced small files fragmentation and seems like have a 0 effect in Trino.

@vburenin
Copy link
Author

vburenin commented Mar 3, 2025

Apparently those small files are DELETE files that are never deleted when I run "ALTER TABLE table EXECUTE optimize".

@hashhar
Copy link
Member

hashhar commented Mar 5, 2025

OPTIMIZE can only remove files if they are not referenced by any snapshots. How many snapshots do you have?

@vburenin
Copy link
Author

vburenin commented Mar 5, 2025

@hashhar Just one, the last one. It is not about the number of snapshots, it is about snapshot itself. Even if it is 1 snapshot, very large number of DELETE files kills performance. For some reason running OPTIMIZE doesn't get rid of dangling empty delete files. I see that there is some development in this regard in Iceberg starting from 1.7 version.

@vburenin
Copy link
Author

vburenin commented Mar 5, 2025

There is certainly a difference between how MERGE INTO is handled by Trino VS Spark. Even with Iceberg 1.5 this issue never comes up in Spark environment.

@vburenin
Copy link
Author

vburenin commented Mar 5, 2025

Here is some log for reference from Spark while I am running DELETE files compactions:

25/03/05 16:26:21 INFO RewritePositionDeleteFilesSparkAction: Rewrite position deletes ready to be committed - Rewriting 6918 position delete files (BIN-PACK, file group 489/1711, org.apache.iceberg.util.StructProjection@6a9c3853 (1/1)) in iceberg.etl_schema.etl_1690235017985880351_some_random_table_name

@dain dain marked this as a duplicate of #25102 Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
iceberg Iceberg connector
Development

No branches or pull requests

3 participants