Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-35493][snapshot] Add historical cleanup for FlinkStateSnapshot CRs #860

Merged
merged 7 commits into from
Aug 20, 2024

Conversation

mateczagany
Copy link
Contributor

Brief change log

  • Added InformerEventSource for Flink resources to be able to list linked snapshots during reconciliation
  • If FlinkStateSnapshot resources are enabled, clean them up by age/count in SnapshotObserver
  • Add unit tests

Verifying this change

  • Added unit tests
  • Manual testing with checkpoint and snapshot resources, with age and count based cleanup policies

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: yes

Documentation

@mateczagany
Copy link
Contributor Author

mateczagany commented Aug 9, 2024

@anupamaggarwal @gyfora @ferenc-csaky

We need this PR to be able to automatically clean up FlinkStateSnapshot resources created by the operator.

In the current state only periodic snapshots will be cleaned up, I was not sure if we want to also include upgrade snapshots. I would love to hear your inputs on this.

I also created new configs for checkpoints, because I felt like users might want to use different cleanup policies for checkpoints than savepoints.

@gyfora
Copy link
Contributor

gyfora commented Aug 9, 2024

Just for clarification, if the snapshot resource is not installed / enabled, will the old cleanup mechanism still work in the legacy mode?

@mateczagany
Copy link
Contributor Author

I forgot to add a check when setting up informer event sources, the latest commit should take care of that. I have manually checked and the cleanup is still working the same way it has been if CRD is not installed.

@mateczagany mateczagany changed the title [FLINK-35493] Add historical cleanup for FlinkStateSnapshot CRs [FLINK-35493][snapshot] Add historical cleanup for FlinkStateSnapshot CRs Aug 10, 2024
@mateczagany
Copy link
Contributor Author

After talking with @gyfora I have made it so legacy savepoints will be cleaned up even if FlinkStateSnapshot is enabled for the job. This will use the number of completed FlinkStateSnapshot CRs and subtract that from the maxCount. This should help integrating jobs from older operator versions.

I have also changed the logic a bit so we extract metadata.creationTimestamp to determine the age of the snapshot as triggerTimestamp might not always be set (e.g. FlinkStateSnapshot controller did not start).

When this PR gets merged, I will also update #862 to test cleaning up legacy savepoints with FlinkStateSnapshot enabled.

Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I missed something but seems to me that in the new cleanup logic we can easily end up In situations where all snapshots are deleted which would be good to avoid

@gyfora gyfora merged commit c8a6ca8 into apache:main Aug 20, 2024
169 checks passed
@mateczagany mateczagany deleted the FLINK-35493 branch August 20, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants