Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

Doris-xm
Copy link

@Doris-xm Doris-xm commented Jan 9, 2025

What this PR does / why we need it:

This PR adds the manifests overlay for Kubeflow Training V2, allowing to install it within Kubeflow Platform.

Which issue(s) this PR fixes :
Fixes #2381

Checklist:

  • Docs included if any changes are user facing

@Electronic-Waste
Copy link
Member

Will review it later.

/cc @kubeflow/wg-training-leads @kubeflow/release-team @saileshd1402
/ok-to-test
/rerun-all

@google-oss-prow google-oss-prow bot requested a review from a team January 9, 2025 15:25
Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/release-team, saileshd1402.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Will review it later.

/cc @kubeflow/wg-training-leads @kubeflow/release-team @saileshd1402
/ok-to-test
/rerun-all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this @Doris-xm! I left some initial comments for you.

Btw, I recommend that you could learn more about the concept of Training V2. This will help you update the manifests overlay in training-operator and kubeflow/manifests:)

FYI: KubeCon 2024 NA Talk by Andrey and Yuki

@Electronic-Waste
Copy link
Member

/rerun-all

@google-oss-prow google-oss-prow bot added size/M and removed size/L labels Jan 10, 2025
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM. Thank you for your great contributions!

/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @willb!
/assign @kubeflow/wg-manifests-leads @kubeflow/release-team

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM!

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xinmin Du added 5 commits January 13, 2025 15:16
Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>
…ace: kubeflow-system

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>
@Doris-xm Doris-xm force-pushed the add-overlay-manifest-v2 branch from 9518f7b to 1f1b0c2 Compare January 13, 2025 07:17
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for updating this:)

/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @saileshd1402

@google-oss-prow google-oss-prow bot added the lgtm label Jan 13, 2025
@andreyvelich andreyvelich changed the title Add the manifests overlay for Kubeflow Training V2 KEP-2170: Add the manifests overlay for Kubeflow Training V2 Jan 14, 2025
@coveralls
Copy link

Pull Request Test Coverage Report for Build 12742381626

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12718184264: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

# Conflicts:
#	manifests/v2/base/manager/kustomization.yaml
#	manifests/v2/base/rbac/kustomization.yaml
#	manifests/v2/base/webhook/kustomization.yaml
#	manifests/v2/overlays/only-manager/kustomization.yaml
#	manifests/v2/overlays/standalone/kustomization.yaml
@google-oss-prow google-oss-prow bot removed the lgtm label Feb 25, 2025
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Electronic-Waste
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM. PTAL if you have time @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @astefanutti

Btw, can you mark some addressed comments as resolved @Doris-xm ? Some of them are outdated.

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Feb 26, 2025
@juliusvonkohout
Copy link
Member

Basically LGTM. PTAL if you have time @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @astefanutti

Btw, can you mark some addressed comments as resolved @Doris-xm ? Some of them are outdated.

/lgtm

I think when you create the PR against kubeflow/manifests including the integrations tests to see how it behaves with the platform components and authorization, security etc. we can provide more input.

@google-oss-prow google-oss-prow bot removed the lgtm label Feb 26, 2025
@Doris-xm
Copy link
Author

Doris-xm commented Mar 1, 2025

Thanks for your detailed reviews. Appreciate for your patience and guidance.
I've addressed the comments above. PTAL if you have time:) @kubeflow/wg-training-leads

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review @Doris-xm. Basically LGTM. Thanks!

I think when you create the PR against kubeflow/manifests including the integrations tests to see how it behaves with the platform components and authorization, security etc. we can provide more input.

It will be better if you could adopt the suggestions from @juliusvonkohout and create another PR againset kubeflow/manifests: kubeflow/manifests#2948

/cc @kubeflow/wg-training-leads @astefanutti @kubeflow/wg-manifests-leads

Comment on lines 1 to 10
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
sortOptions:
order: fifo
resources:
- ../../base/crds
- ../../base/manager
- ../../base/rbac
- ../../base/runtimes/pretraining
Copy link
Member

@Electronic-Waste Electronic-Waste Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does FIFO work fine to install all these components? Could you please examine it and provide the screenshot here? @Doris-xm

Copy link
Author

@Doris-xm Doris-xm Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion. I will create a PR in kubeflow/manifests to test it under CI/CD soon.

In my experiment, put runtimes installation before the webhooks works well:

screenshot2025-03-14 23 55 04
The installation is shown above. The trainer-controller-manager pod started successfully soon. I took this screenshot before it was ready.
Have I verified all necessary resources? @Electronic-Waste @andreyvelich PTAL if you have time:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! That sounds good to me. I have no other comment. Do you have any other ideas? @kubeflow/wg-training-leads @astefanutti

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing is that, as a required dependency, logically JobSet should be deployed first, so maybe move it in first position of the list, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing is that, as a required dependency, logically JobSet should be deployed first, so maybe move it in first position of the list, WDYT?

SGTM

@google-oss-prow google-oss-prow bot requested review from a team and astefanutti March 13, 2025 12:44
Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/wg-manifests-leads.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Sorry for the late review @Doris-xm. Basically LGTM. Thanks!

I think when you create the PR against kubeflow/manifests including the integrations tests to see how it behaves with the platform components and authorization, security etc. we can provide more input.

It will be better if you could adopt the suggestions from @juliusvonkohout and create another PR againset kubeflow/manifests: kubeflow/manifests#2948

/cc @kubeflow/wg-training-leads @astefanutti @kubeflow/wg-manifests-leads

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Xinmin Du <[email protected]>
@@ -4,12 +4,12 @@ namespace: kubeflow
sortOptions:
order: fifo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the same for the standalone installation ?
So we don't need to install manager + runtimes in the separate overlays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Could you please create a separate issue to address it? @Doris-xm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion. I start to work on it.

@Electronic-Waste
Copy link
Member

Thanks for this @Doris-xm! I believe, we can move forward with this PR. In the future, we can raise another PR to integrate trainer into the kubeflow platform: kubeflow/manifests#2948

/lgtm
/assign @kubeflow/wg-training-leads @astefanutti

@google-oss-prow google-oss-prow bot added the lgtm label Mar 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Add Installation for Kubeflow Platform
7 participants