Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support Iceberg optimization tool #9313

Open
nqvuong1998 opened this issue Aug 9, 2024 · 2 comments
Open

[Feature]: Support Iceberg optimization tool #9313

nqvuong1998 opened this issue Aug 9, 2024 · 2 comments

Comments

@nqvuong1998
Copy link

Description

Currently, the Nessie catalog has a GC tool to clean up orphaned files. Additionally, Nessie can support an optimization tool to compact, sort, and expire snapshots for Iceberg tables.

Expected Use Cases

GC and optimization tools are useful for keeping Iceberg tables optimized.

Requested Changes in public API

No response

@nqvuong1998
Copy link
Author

cc @snazy @ajantha-bhat

@ajantha-bhat
Copy link
Contributor

ajantha-bhat commented Aug 9, 2024

@nqvuong1998: Nessie GC handles expire snapshots and orphan files together. So, we don't need expire snapshots implementation again. But due to Nessie's feature of catalog level tags, we cannot update the table metadata on the tags after running expire snapshots functionality. Hence, Nessie GC is not updating the table metadata files.

Other operations like compactions (including sort), can work with Engine's existing implementation.
By default it runs spark procedures (say compaction) at table at branch. But we can specify the branch info also.
Testcase:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants