-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MERGE INTO sends Trino Coordinator into constant GC state due to highly fragmented Iceberg tables #25192
Comments
It comes a bits as a surprise, we see excessive number of small files of 1.5-3kb in size after Merge INTO operations, like hundreds per partition. When we were running Trino 419 it was not the case. Of course, we are trying to run OPTIMIZE regularly to counter that. The SQL query that produces data MERGE INTO input also ensures that data is sorted by partition, which is in case of Spark reduced small files fragmentation and seems like have a 0 effect in Trino. |
Apparently those small files are DELETE files that are never deleted when I run "ALTER TABLE table EXECUTE optimize". |
OPTIMIZE can only remove files if they are not referenced by any snapshots. How many snapshots do you have? |
@hashhar Just one, the last one. It is not about the number of snapshots, it is about snapshot itself. Even if it is 1 snapshot, very large number of DELETE files kills performance. For some reason running OPTIMIZE doesn't get rid of dangling empty delete files. I see that there is some development in this regard in Iceberg starting from 1.7 version. |
There is certainly a difference between how MERGE INTO is handled by Trino VS Spark. Even with Iceberg 1.5 this issue never comes up in Spark environment. |
Here is some log for reference from Spark while I am running DELETE files compactions:
|
Trino Coordinator may go out of service if it opens enough highly fragmented partitioned Iceberg tables during MERGE INTO operation. I've been fighting an issue recently where a bunch of clusters were very unstable OOMing on coordinator side. I increased allocation to 240GB of JVM heap and it was quickly running out of RAM too. I traced it down to an unsuccessfully maintained Iceberg tables that became highly fragmented (tons of files like 1-3Kb in size). One table doesn't cause the issue, several such tables together cause the problem. The last thing that coordinator does is reading a lot of ./.../metadata/*.avro files and then it is done according to Envoy logs.
In my setup reproduction step would require having ~10 50-100GB tables that have millions of parquet files of 1-5kb in size. Simultaneous MERGE INTO operation that updates good number of partitions in those tables exhaust Coordinator RAM very quickly and it goes into a sad state of a constant GC.
I am not sure there is any good resolution possible on Trino end, but auto detection of highly fragmented tables that may cause troubles would be a good start.
The text was updated successfully, but these errors were encountered: