-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-35414] Rework last-state upgrade mode to support job cancellation as suspend mechanism #871
Conversation
1dfb263
to
24f0742
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested this change extensively locally and did not find any issues. Added some minor comments/questions, but overall I am happy with these changes.
...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java
Outdated
Show resolved
Hide resolved
...rnetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java
Outdated
Show resolved
Hide resolved
...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java
Show resolved
Hide resolved
...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractJobReconciler.java
Show resolved
Hide resolved
...c/main/java/org/apache/flink/kubernetes/operator/config/KubernetesOperatorConfigOptions.java
Outdated
Show resolved
Hide resolved
...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractJobReconciler.java
Show resolved
Hide resolved
...rator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/ReconciliationUtils.java
Outdated
Show resolved
Hide resolved
...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java
Show resolved
Hide resolved
...operator/src/main/java/org/apache/flink/kubernetes/operator/validation/DefaultValidator.java
Show resolved
Hide resolved
…ion as suspend mechanism
Thanks for the review @mateczagany , addressed your comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gyfora! The core changes are a bit tricky to review due to the refactorings and related changes, but LGTM!
@gyfora FYI, I think I found a regression: https://issues.apache.org/jira/browse/FLINK-36673 |
What is the purpose of the change
Rework the last-state upgrade mode to not be solely reliant on HA metadata but to be flexible and use the job cancel mechanism in other cases. This change also allows the session jobs to use last-state upgrade mode where HA metadata is not accessible the same way as for Application clusters.
Last state upgrades using cancel
Currently last-state upgrade mode relies purely on HA metadata that is available for application deployments to simulate a failover during upgrade and make the JM pick up the correct last state automatically. This has a couple limitations, first and foremost is that it is not applicable to session jobs.
With this PR we introduce a new mechanism for last-state upgrades of non-terminal jobs (the terminal case is already covered by existing mechanisms):
This new mechanism is similar to what a human operator would do for these jobs and does not rely on HA metadata and works for both application and session jobs and also in cases where HA metadata is not usable otherwise such as during version upgrades, or if HA is disabled etc.
Changes to the reconciliation flow for correct cancellation during upgrades
Currently the async nature of cancellation is not handled correctly in the reconciler even though session jobs use this to cancel jobs which can lead to in extreme cases 2 parallel jobs running on the same cluster.
To handle this, the reconciler now explicitly checks for cancelling state and does not perform other upgrade actions until that completes. Also after initiating an async cancel action through the REST API we immediately exit and re-schedule the observation to wait until the cancellation completes and we can observe the last state of the cluster.
The observer now recognises the CANCELLING state also as special user initiated action and when the job becomes CANCELLED (or not found in case of session jobs) it marks it explicitly SUSPENDED. This means that the reconciler will always resumes it subsequently, eliminating a risk of ending up with a cancelled job if the spec change was rolled back in the meantime.
Refactored and improved FlinkService cancel methods
To eliminate duplicate logic and overall reduce complexity the cancel application / session jobs methods have been refactored to re-use the common parts. Also a significant portion of the logic has been removed by separating the suspend and restore (upgrade) mechanism.
The
JobUpgrade
utility class now encapsulates the necessary suspend and restore mechanism for the stateful upgrade depending on the current observed state and also. This allows us to better handle cases of async cancellation (SuspendMode.CANCEL) or if the job is already cancelled (or in terminal state) do nothing (SuspendMode.NOOP) and simply perform the restore.Misc session job changes / fixes
In addition to making last-state upgrade mode generally available for session jobs this PR includes several critical fixes to the core upgrade cleanup logic as a result of this work such as:
Other changes / improvements as an outcome
Verifying this change
Does this pull request potentially affect one of the following parts:
CustomResourceDescriptors
: noDocumentation