-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Don't fail suspended Kubeflow jobs #6295
base: fg91/dep/upgrade-kubeflow-training-operator
Are you sure you want to change the base?
Fix: Don't fail suspended Kubeflow jobs #6295
Conversation
Code Review Agent Run #ae6c12Actionable Suggestions - 0Review Details
|
Changelist by BitoThis pull request implements the following key changes.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## fg91/dep/upgrade-kubeflow-training-operator #6295 +/- ##
===============================================================================
+ Coverage 58.48% 58.50% +0.02%
===============================================================================
Files 937 937
Lines 71088 71091 +3
===============================================================================
+ Hits 41577 41594 +17
+ Misses 26359 26348 -11
+ Partials 3152 3149 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
72d0107
to
5effd97
Compare
Signed-off-by: Fabio Graetz <[email protected]>
9559d64
to
ce5a5d8
Compare
Code Review Agent Run #58db2dActionable Suggestions - 1
Review Details
|
isSuspended := app.Spec.RunPolicy.Suspend != nil && *app.Spec.RunPolicy.Suspend | ||
if !isSuspended && app.Status.StartTime == nil && app.CreationTimestamp.Add(common.GetConfig().Timeout.Duration).Before(time.Now()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider checking if app.Spec.RunPolicy
is nil before accessing app.Spec.RunPolicy.Suspend
. This would prevent a potential nil pointer dereference if RunPolicy
is not initialized.
Code suggestion
Check the AI-generated fix before applying
isSuspended := app.Spec.RunPolicy.Suspend != nil && *app.Spec.RunPolicy.Suspend | |
if !isSuspended && app.Status.StartTime == nil && app.CreationTimestamp.Add(common.GetConfig().Timeout.Duration).Before(time.Now()) { | |
isSuspended := false | |
if app.Spec.RunPolicy != nil && app.Spec.RunPolicy.Suspend != nil && *app.Spec.RunPolicy.Suspend { | |
isSuspended = true | |
} | |
if !isSuspended && app.Status.StartTime == nil && app.CreationTimestamp.Add(common.GetConfig().Timeout.Duration).Before(time.Now()) { |
Code Review Run #58db2d
Should Bito avoid suggestions like this for future reviews? (Manage Rules)
- Yes, avoid them
Why are the changes needed?
The flyteplugins kubeflow plugins currently fail tasks if the underlying CRD object (PyTorchJob, TfJob, MpiJob) haven't been updated by the training operator for a certain timeout period (see here):
However, the jobs provided by the kubeflow training operator can be in a so-called "suspended" state (e.g. when using them with an external queueing system like Kueue). Jobs in a suspended state are expected to not be updated by the training operator.
What changes were proposed in this pull request?
I propose to add a check for suspension to the error condition in the code snippet above so that the flyteplugin ignores unmodified kubeflow jobs if they are in a suspended state.
Please note that other flyteplugins like spark or ray don't have this "CRD has been updated" check at all so this change does not introduce inconsistency.
How was this patch tested?
Added unit tests. Ran flytepropeller image with this change in cluster.
Check all the applicable boxes
Related PRs
Summary by Bito
This PR updates Kubeflow plugins (MPI, PyTorch, TensorFlow) to properly handle suspended jobs by adding suspension checks to timeout failure conditions. It prevents premature task failures by verifying job suspension status before raising errors, ensuring suspended jobs are treated appropriately without affecting other plugins.Unit tests added: False
Estimated effort to review (1-5, lower is better): 1