[FLINK-35414] Rework last-state upgrade mode to support job cancellation as suspend mechanism #871

gyfora · 2024-08-26T08:53:25Z

What is the purpose of the change

Rework the last-state upgrade mode to not be solely reliant on HA metadata but to be flexible and use the job cancel mechanism in other cases. This change also allows the session jobs to use last-state upgrade mode where HA metadata is not accessible the same way as for Application clusters.

Last state upgrades using cancel

Currently last-state upgrade mode relies purely on HA metadata that is available for application deployments to simulate a failover during upgrade and make the JM pick up the correct last state automatically. This has a couple limitations, first and foremost is that it is not applicable to session jobs.

With this PR we introduce a new mechanism for last-state upgrades of non-terminal jobs (the terminal case is already covered by existing mechanisms):

Cancel the job through rest API (async operation)
Wait until the job cancellation completes and the job becomes CANCELLED (terminal state)
Observe last state information through REST API and use that for upgrade (upgrade flow already there for terminal jobs)

This new mechanism is similar to what a human operator would do for these jobs and does not rely on HA metadata and works for both application and session jobs and also in cases where HA metadata is not usable otherwise such as during version upgrades, or if HA is disabled etc.

Changes to the reconciliation flow for correct cancellation during upgrades

Currently the async nature of cancellation is not handled correctly in the reconciler even though session jobs use this to cancel jobs which can lead to in extreme cases 2 parallel jobs running on the same cluster.

To handle this, the reconciler now explicitly checks for cancelling state and does not perform other upgrade actions until that completes. Also after initiating an async cancel action through the REST API we immediately exit and re-schedule the observation to wait until the cancellation completes and we can observe the last state of the cluster.

The observer now recognises the CANCELLING state also as special user initiated action and when the job becomes CANCELLED (or not found in case of session jobs) it marks it explicitly SUSPENDED. This means that the reconciler will always resumes it subsequently, eliminating a risk of ending up with a cancelled job if the spec change was rolled back in the meantime.

Refactored and improved FlinkService cancel methods

To eliminate duplicate logic and overall reduce complexity the cancel application / session jobs methods have been refactored to re-use the common parts. Also a significant portion of the logic has been removed by separating the suspend and restore (upgrade) mechanism.

The JobUpgrade utility class now encapsulates the necessary suspend and restore mechanism for the stateful upgrade depending on the current observed state and also. This allows us to better handle cases of async cancellation (SuspendMode.CANCEL) or if the job is already cancelled (or in terminal state) do nothing (SuspendMode.NOOP) and simply perform the restore.

Misc session job changes / fixes

In addition to making last-state upgrade mode generally available for session jobs this PR includes several critical fixes to the core upgrade cleanup logic as a result of this work such as:

Improved cleanup method that correctly waits until the job is fully cancelled instead of deleting the CR too early (risk of leaving the job there)
Call observe during cancel for session jobs for correct behaviour
Use correct job config generation for session jobs similar to applications, such as retaining checkpoints during cancellation by default which is needed for the above cancel mechanism

Other changes / improvements as an outcome

Remove last-state upgrade limitations for apps and use cancel in these cases (flink version upgrade for non-running jobs, jobs without HA enabled)

Verifying this change

Existing unit and E2Es guard the current behaviour
New unit tests have been added to cover the session job last-state upgrades and the improved observe, reconcile, cleanup flow
Extensive manual testing on local kubernetes
Session job e2e extended with last-state test

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: yes

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? docs updated

mateczagany

I have tested this change extensively locally and did not find any issues. Added some minor comments/questions, but overall I am happy with these changes.

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java

...rnetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractJobReconciler.java

...c/main/java/org/apache/flink/kubernetes/operator/config/KubernetesOperatorConfigOptions.java

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractJobReconciler.java

...rator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/ReconciliationUtils.java

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java

...operator/src/main/java/org/apache/flink/kubernetes/operator/validation/DefaultValidator.java

…ion as suspend mechanism

gyfora · 2024-10-03T08:54:57Z

Thanks for the review @mateczagany , addressed your comments

mxm

Thanks @gyfora! The core changes are a bit tricky to review due to the refactorings and related changes, but LGTM!

sap1ens · 2024-11-07T23:51:27Z

@gyfora FYI, I think I found a regression: https://issues.apache.org/jira/browse/FLINK-36673

gyfora force-pushed the FLINK-35414 branch 4 times, most recently from 1dfb263 to 24f0742 Compare August 29, 2024 06:21

gyfora force-pushed the FLINK-35414 branch from d325322 to c3c476c Compare September 6, 2024 09:59

mateczagany reviewed Oct 1, 2024

View reviewed changes

gyfora added 8 commits October 3, 2024 09:44

[FLINK-35414] Rework last-state upgrade mode to support job cancellat…

9e2fe19

…ion as suspend mechanism

update e2e

4f97ce9

doc update

c228b8c

Add extra job upgrade tests

a293a38

Fix job upgrade mode switching corner cases

41f8e2d

test fixes

346f01b

Skip last-state test for Flink 1.16

324fc9b

Review comments

ff8bbbc

gyfora force-pushed the FLINK-35414 branch from c3c476c to ff8bbbc Compare October 3, 2024 07:44

mateczagany approved these changes Oct 3, 2024

View reviewed changes

mxm approved these changes Oct 4, 2024

View reviewed changes

gyfora merged commit d1827a4 into apache:main Oct 4, 2024
229 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-35414] Rework last-state upgrade mode to support job cancellation as suspend mechanism #871

[FLINK-35414] Rework last-state upgrade mode to support job cancellation as suspend mechanism #871

gyfora commented Aug 26, 2024 •

edited

Loading

mateczagany left a comment

gyfora commented Oct 3, 2024

mxm left a comment

sap1ens commented Nov 7, 2024

[FLINK-35414] Rework last-state upgrade mode to support job cancellation as suspend mechanism #871

[FLINK-35414] Rework last-state upgrade mode to support job cancellation as suspend mechanism #871

Conversation

gyfora commented Aug 26, 2024 • edited Loading

What is the purpose of the change

Last state upgrades using cancel

Changes to the reconciliation flow for correct cancellation during upgrades

Refactored and improved FlinkService cancel methods

Misc session job changes / fixes

Other changes / improvements as an outcome

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

mateczagany left a comment

Choose a reason for hiding this comment

gyfora commented Oct 3, 2024

mxm left a comment

Choose a reason for hiding this comment

sap1ens commented Nov 7, 2024

gyfora commented Aug 26, 2024 •

edited

Loading