Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dep: Upgrade kubeflow training operator #6294

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Mar 2, 2025

Why are the changes needed?

Goal: Enable #6295 which prevents the flyteplugins kubeflow plugins from failing a task when a PytorchJob, TfJob, ... hasn't been updated by the kubeflow training operator because the job was suspended (e.g. by an external queueing system).

The run policy's "suspend" attribute is introduced only in a newer version of the kubeflow training operator.

What changes were proposed in this pull request?

Upgrade the kubeflow training operator dependency.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

#6295

Summary by Bito

This pull request upgrades the kubeflow training operator dependency and related k8s libraries across multiple modules to support new suspend functionality. The changes include standardizing API usage by replacing legacy commonOp types with kubeflowv1 types in PyTorch and TensorFlow operators, ensuring compatibility with newer operator policies.

Unit tests added: False

Estimated effort to review (1-5, lower is better): 5

@flyte-bot
Copy link
Collaborator

flyte-bot commented Mar 2, 2025

Code Review Agent Run #bfd79d

Actionable Suggestions - 0
Review Details
  • Files reviewed - 12 · Commit Range: a881934..a881934
    • flyteplugins/go.mod
    • flyteplugins/go.sum
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/common/common_operator.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/common/common_operator_test.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi_test.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/pytorch/pytorch_test.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/tensorflow/tensorflow.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/tensorflow/tensorflow_test.go
    • go.mod
    • go.sum
  • Files skipped - 0
  • Tools
    • Golangci-lint (Linter) - ✖︎ Failed
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • SNYK (Security Vulnerability) - ✔︎ Successful
    • GOVULNCHECK (Security Vulnerability) - ✖︎ Failed
    • OWASP (Security Vulnerability) - ✔︎ Successful

AI Code Review powered by Bito Logo

Copy link

codecov bot commented Mar 2, 2025

Codecov Report

Attention: Patch coverage is 60.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 58.48%. Comparing base (b779bed) to head (5effd97).

Files with missing lines Patch % Lines
.../plugins/k8s/kfoperators/common/common_operator.go 43.33% 17 Missing ⚠️
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6294      +/-   ##
==========================================
- Coverage   58.49%   58.48%   -0.01%     
==========================================
  Files         937      937              
  Lines       71088    71088              
==========================================
- Hits        41583    41577       -6     
- Misses      26353    26359       +6     
  Partials     3152     3152              
Flag Coverage Δ
unittests-datacatalog 59.06% <ø> (ø)
unittests-flyteadmin 56.27% <ø> (-0.03%) ⬇️
unittests-flytecopilot 30.99% <ø> (ø)
unittests-flytectl 64.70% <ø> (ø)
unittests-flyteidl 76.12% <ø> (ø)
unittests-flyteplugins 61.00% <60.00%> (ø)
unittests-flytepropeller 54.79% <ø> (ø)
unittests-flytestdlib 64.04% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@flyte-bot
Copy link
Collaborator

flyte-bot commented Mar 2, 2025

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Other Improvements - Dependency Upgrades for Compatibility

go.mod - Upgraded multiple dependencies including github.com/golang/protobuf and various k8s modules.

go.sum - Synchronized checksum updates to reflect upgraded dependency versions.

go.mod - Updated dependency versions in go.mod to align with newer modules for kubeflow and related libraries.

go.sum - Refreshed go.sum file with new checksums resulting from dependency upgrades.

go.mod - Upgraded kubeflow training operator dependency and updated related modules.

go.sum - Removed deprecated dependency entries and updated checksums to reflect dependency upgrades.

go.sum - Removed unused dependency entries, reflecting cleanup post-upgrade.

go.mod - Upgraded dependency versions for protobuf and k8s modules to newer versions.

go.sum - Upgraded dependency versions and updated checksums with removal of obsolete entries; kubeflow training operator upgraded to v1.8.0.

go.mod - Upgraded dependency versions including controller-runtime, fsnotify, and github.com/gorilla/websocket reflecting updated dependencies.

go.sum - Updated go.sum file with refreshed dependency checksums and removal of unused entries.

Feature Improvement - Kubeflow API Upgrade Implementation

pytorch.go - Updated PyTorch job specifications by replacing commonOp types with kubeflowv1, correcting container name casing and job kind references.

pytorch_test.go - Refactored test functions to utilize kubeflowv1 job condition types and updated replica specifications.

tensorflow.go - Modified TensorFlow resource construction to use kubeflowv1 API structures for replica specs and run policy.

tensorflow_test.go - Aligned TensorFlow test cases with kubeflowv1 job conditions and API changes.

common_operator.go - Previously refactored to use kubeflowv1 API calls for job condition handling.

common_operator_test.go - Previously updated tests to reflect kubeflowv1 API changes in operator logic.

mpi.go - Previously migrated MPI operator code to use kubeflowv1 API for run policy and replica specs.

mpi_test.go - Previously revised MPI tests to reflect updated kubeflowv1 job condition and replica specifications.

Testing - Test Error Message Refinements

matchable_cluster_resource_attribute_test.go - Revised error message assertion to compare against err.Error() instead of formatted output.

matchable_execution_cluster_label_test.go - Standardized error message comparisons by replacing fmt.Errorf with direct string comparisons.

matchable_execution_queue_attribute_test.go - Modified test assertions to directly compare error messages for JSON unmarshalling failures.

matchable_plugin_override_test.go - Updated test error output to consistently use err.Error() for JSON unmarshaling errors.

matchable_task_resource_attribute_test.go - Aligned error message checks by switching from fmt.Errorf formatted output to err.Error() string comparison.

matchable_workflow_execution_config_test.go - Refined test logic by updating error messages to utilize err.Error() for clarity.

@flyte-bot
Copy link
Collaborator

flyte-bot commented Mar 2, 2025

Code Review Agent Run #052221

Actionable Suggestions - 0
Review Details
  • Files reviewed - 2 · Commit Range: a881934..da7af29
    • go.mod
    • go.sum
  • Files skipped - 0
  • Tools
    • Golangci-lint (Linter) - ✖︎ Failed
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • SNYK (Security Vulnerability) - ✔︎ Successful
    • GOVULNCHECK (Security Vulnerability) - ✖︎ Failed
    • OWASP (Security Vulnerability) - ✔︎ Successful

AI Code Review powered by Bito Logo

Copy link

github-actions bot commented Mar 2, 2025

Dependency Review

The following issues were found:

  • ❌ 1 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 5 package(s) with unknown licenses.

View full job summary

@fg91 fg91 marked this pull request as draft March 2, 2025 16:18
@flyte-bot
Copy link
Collaborator

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@fg91 fg91 force-pushed the fg91/dep/upgrade-kubeflow-training-operator branch from 72d0107 to 5effd97 Compare March 7, 2025 06:07
@fg91 fg91 added the dependencies Pull requests that update a dependency file label Mar 7, 2025
@fg91 fg91 self-assigned this Mar 7, 2025
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result of go mod tidy.

fmt.Errorf("error unmarshaling JSON: while decoding JSON: json: unknown field \"InvalidDomain\""),
err)
"error unmarshaling JSON: while decoding JSON: json: unknown field \"InvalidDomain\"",
err.Error())
s.DeleterExt.AssertNotCalled(t, "DeleteProjectDomainAttributes",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to a dependency upgrade the type of the error changed:

            	Error:      	Not equal: 
            	            	expected: *errors.errorString(&errors.errorString{s:"error unmarshaling JSON: while decoding JSON: json: unknown field \"InvalidDomain\""})
            	            	actual  : *fmt.wrapError(&fmt.wrapError{msg:"error unmarshaling JSON: while decoding JSON: json: unknown field \"InvalidDomain\"", err:(*errors.errorString)(0xc0001314f0)})
            	Test:       	TestDeleteTaskResourceAttributes/attribute_deletion_invalid_file

Let's only compare the error message, not the type.

github.com/hashicorp/golang-lru v0.5.4
github.com/imdario/mergo v0.3.13
github.com/kubeflow/common v0.4.3
github.com/kubeflow/training-operator v1.5.0-rc.0
github.com/kubeflow/training-operator v1.8.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One important change here is that github.com/kubeflow/common isn't needed anymore as it has been integrated into github.com/kubeflow/training-operator.

@@ -6,7 +6,7 @@ import (
"sort"
"time"

commonOp "github.com/kubeflow/common/pkg/apis/common/v1"
kubeflowv1 "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1"
v1 "k8s.io/api/core/v1"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github.com/kubeflow/common has been integrated into github.com/kubeflow/training-operator. This leads to a lot of commonOp -> kubeflowv1 changes below.

@@ -1204,7 +1203,7 @@ func TestBuildResourcePytorchV1WithElastic(t *testing.T) {
var hasContainerWithDefaultPytorchName = false

for _, container := range pytorchJob.Spec.PyTorchReplicaSpecs[kubeflowv1.PyTorchJobReplicaTypeWorker].Template.Spec.Containers {
if container.Name == kubeflowv1.PytorchJobDefaultContainerName {
if container.Name == kubeflowv1.PyTorchJobDefaultContainerName {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytorch -> PyTorch

@fg91 fg91 marked this pull request as ready for review March 7, 2025 06:28
@flyte-bot
Copy link
Collaborator

flyte-bot commented Mar 7, 2025

Code Review Agent Run #869081

Actionable Suggestions - 0
Filtered by Review Rules

Bito filtered these suggestions based on rules created automatically for your feedback. Manage rules.

  • flyteplugins/go.mod - 1
Review Details
  • Files reviewed - 25 · Commit Range: 7e6188b..5effd97
    • flyteadmin/go.mod
    • flyteadmin/go.sum
    • flytectl/cmd/delete/matchable_cluster_resource_attribute_test.go
    • flytectl/cmd/delete/matchable_execution_cluster_label_test.go
    • flytectl/cmd/delete/matchable_execution_queue_attribute_test.go
    • flytectl/cmd/delete/matchable_plugin_override_test.go
    • flytectl/cmd/delete/matchable_task_resource_attribute_test.go
    • flytectl/cmd/delete/matchable_workflow_execution_config_test.go
    • flytectl/go.mod
    • flytectl/go.sum
    • flyteidl/go.sum
    • flyteplugins/go.mod
    • flyteplugins/go.sum
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/common/common_operator.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/common/common_operator_test.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi_test.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/pytorch/pytorch_test.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/tensorflow/tensorflow.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/tensorflow/tensorflow_test.go
    • flytepropeller/go.mod
    • flytepropeller/go.sum
    • go.mod
    • go.sum
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • OWASP (Security Vulnerability) - ✔︎ Successful
    • GOVULNCHECK (Security Vulnerability) - ✖︎ Failed
    • SNYK (Security Vulnerability) - ✔︎ Successful

AI Code Review powered by Bito Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants