Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-35414] Rework last-state upgrade mode to support job cancellation as suspend mechanism #871

Merged
merged 8 commits into from
Oct 4, 2024

Conversation

gyfora
Copy link
Contributor

@gyfora gyfora commented Aug 26, 2024

What is the purpose of the change

Rework the last-state upgrade mode to not be solely reliant on HA metadata but to be flexible and use the job cancel mechanism in other cases. This change also allows the session jobs to use last-state upgrade mode where HA metadata is not accessible the same way as for Application clusters.

Last state upgrades using cancel

Currently last-state upgrade mode relies purely on HA metadata that is available for application deployments to simulate a failover during upgrade and make the JM pick up the correct last state automatically. This has a couple limitations, first and foremost is that it is not applicable to session jobs.

With this PR we introduce a new mechanism for last-state upgrades of non-terminal jobs (the terminal case is already covered by existing mechanisms):

  1. Cancel the job through rest API (async operation)
  2. Wait until the job cancellation completes and the job becomes CANCELLED (terminal state)
  3. Observe last state information through REST API and use that for upgrade (upgrade flow already there for terminal jobs)

This new mechanism is similar to what a human operator would do for these jobs and does not rely on HA metadata and works for both application and session jobs and also in cases where HA metadata is not usable otherwise such as during version upgrades, or if HA is disabled etc.

Changes to the reconciliation flow for correct cancellation during upgrades

Currently the async nature of cancellation is not handled correctly in the reconciler even though session jobs use this to cancel jobs which can lead to in extreme cases 2 parallel jobs running on the same cluster.

To handle this, the reconciler now explicitly checks for cancelling state and does not perform other upgrade actions until that completes. Also after initiating an async cancel action through the REST API we immediately exit and re-schedule the observation to wait until the cancellation completes and we can observe the last state of the cluster.

The observer now recognises the CANCELLING state also as special user initiated action and when the job becomes CANCELLED (or not found in case of session jobs) it marks it explicitly SUSPENDED. This means that the reconciler will always resumes it subsequently, eliminating a risk of ending up with a cancelled job if the spec change was rolled back in the meantime.

Refactored and improved FlinkService cancel methods

To eliminate duplicate logic and overall reduce complexity the cancel application / session jobs methods have been refactored to re-use the common parts. Also a significant portion of the logic has been removed by separating the suspend and restore (upgrade) mechanism.

The JobUpgrade utility class now encapsulates the necessary suspend and restore mechanism for the stateful upgrade depending on the current observed state and also. This allows us to better handle cases of async cancellation (SuspendMode.CANCEL) or if the job is already cancelled (or in terminal state) do nothing (SuspendMode.NOOP) and simply perform the restore.

Misc session job changes / fixes

In addition to making last-state upgrade mode generally available for session jobs this PR includes several critical fixes to the core upgrade cleanup logic as a result of this work such as:

  • Improved cleanup method that correctly waits until the job is fully cancelled instead of deleting the CR too early (risk of leaving the job there)
  • Call observe during cancel for session jobs for correct behaviour
  • Use correct job config generation for session jobs similar to applications, such as retaining checkpoints during cancellation by default which is needed for the above cancel mechanism

Other changes / improvements as an outcome

  • Remove last-state upgrade limitations for apps and use cancel in these cases (flink version upgrade for non-running jobs, jobs without HA enabled)

Verifying this change

  • Existing unit and E2Es guard the current behaviour
  • New unit tests have been added to cover the session job last-state upgrades and the improved observe, reconcile, cleanup flow
  • Extensive manual testing on local kubernetes
  • Session job e2e extended with last-state test

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: yes

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? docs updated

Copy link
Contributor

@mateczagany mateczagany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this change extensively locally and did not find any issues. Added some minor comments/questions, but overall I am happy with these changes.

@gyfora
Copy link
Contributor Author

gyfora commented Oct 3, 2024

Thanks for the review @mateczagany , addressed your comments

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gyfora! The core changes are a bit tricky to review due to the refactorings and related changes, but LGTM!

@gyfora gyfora merged commit d1827a4 into apache:main Oct 4, 2024
229 checks passed
@sap1ens
Copy link
Contributor

sap1ens commented Nov 7, 2024

@gyfora FYI, I think I found a regression: https://issues.apache.org/jira/browse/FLINK-36673

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants