Skip to content

Conversation

@justinsb
Copy link
Member

No description provided.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 15, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 15, 2025
@justinsb justinsb force-pushed the discovery branch 2 times, most recently from 8396a06 to eead354 Compare December 15, 2025 18:52
MinVersion: tls.VersionTLS12,
}

server := &http.Server{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http3 ? I had white wine before

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get it ... would you like to use http3? Have I opted in to http3 automatically? Do you want us to be the "guinea pig" for http3 in kube (if so, I'm game!)

I don't think there's anything special we need from http3. We do need the client certificate information. I was thinking we would probably end up deploying this directly behind an L4 load balancer, or (failing that) using ingress or gateway with SNI routing.

In terms of backends, right now I have this with a simple in-memory implementation. Honestly that's probably good enough to get started, as we will not be offering any guarantee as to retention of these objects.

But ... if we wanted to do better, I think we should put them into etcd because (1) we should be able to run etcd pretty cheaply and we don't have to worry about wracking up a huge GCS bill if someone figures out how to make us send queries to GCS etc and (2) it means that we can use etcd-operator, which would be good from the "all the wood behind one arrow" perspective

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was half-joking about using http3 for the discovery server but looks like the OIDC protocol is only compatible with HTTP 1.1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can support http3, but let's start with whatever go gives us out of the box (which I think is still http1 or http2)

I do think a controversial one would be to support DNS over HTTP, if you're feeling spicy :-)

@@ -0,0 +1,82 @@
# Using kubectl with Discovery Service

Since the Discovery Service now emulates the Kubernetes API for `DiscoveryEndpoint` resources, you can use `kubectl` to interact with it.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to ask gemini to clean up these instructions / demo scripts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up!

}

// DiscoveryEndpoint represents a registered client in the discovery service.
type DiscoveryEndpoint struct {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should move this to apis/discovery.kops.k8s.io/v1alpha1 for consistency (and probably change the version to v1alpha1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

{
Name: "discoveryendpoints",
SingularName: "discoveryendpoint",
Namespaced: true,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably should be cluster scoped, although I guess we could use the namespace to indicate the cluster if we we wanted to allow multiple clusters to share the same CA (which isn't a terrible idea if someone is doing multicluster).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider RBAC in the equation ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this isn't technically kube-apiserver, and I haven't implemented RBAC.

Right now we have this: anyone that has any cert signed by a CA can read the objects for that CA's universe (defined by the hash of the CA public key). You can write an object that matches your own CN only.

I probably need to build out the client side here to better understand what we actually need, whether it's acceptable to have the same CA certificate etc. (e.g. maybe we should only let kubelet certificates register, or maybe we should only let control plane nodes register, or maybe we should create a dedicated CA only for discovery)

discovery/go.mod Outdated

require (
k8s.io/apimachinery v0.34.3
k8s.io/client-go v0.34.3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically client-go / apimachinery is only used by the clients / tests, so it might be nice to split them out. But to do that would require a separate go.mod, which is a bit of a pain.

@justinsb justinsb force-pushed the discovery branch 4 times, most recently from 2136f14 to 294dd35 Compare December 16, 2025 18:00
return s
}

func (s *Server) registerRoutes() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we never delete endpoints ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not currently, no. You're right - I should add a TTL. (Maybe 2 hours, and then we can have nodes register every hour?). It won't be a "hard" TTL - we reserve the rights to remove objects at any time (and I should add that to the README.md / GEMINI.md)

I should probably also add explicit deletion support.

@justinsb justinsb force-pushed the discovery branch 4 times, most recently from 1d369ae to 8ee0bca Compare December 30, 2025 23:31
@justinsb
Copy link
Member Author

justinsb commented Jan 1, 2026

  • Behind a feature flag DiscoveryService
  • Server deployed (temporarily?) at https://discovery.kubedisco.com, using manifest in repo
  • Simple e2e test that verifies the behavior (but does not yet actually try using the issuer e.g. in the e2e test)
  • Data lives in-process, and is not garbage collected
  • nodeup on the control-plane registers JWKS data

Big TODOs, which I propose we do in follow on:

  • Deploy on k8s.io?
  • Implement etcd backend, including TTL
  • Implement re-registration (maybe by running nodeup as a periodic systemd job?, or maybe something more lightweight) so that we do not lose registration after TTL
  • Add kube e2e test to make sure this actually works!
  • Probably many other things

@justinsb justinsb changed the title WIP: simple discovery server Simple discovery server Jan 1, 2026
@justinsb justinsb marked this pull request as ready for review January 1, 2026 02:59
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 1, 2026
@k8s-ci-robot k8s-ci-robot requested a review from olemarkus January 1, 2026 03:00
@justinsb
Copy link
Member Author

justinsb commented Jan 1, 2026

Removing WIP; still work to be done but it's safely behind a feature flag

@justinsb
Copy link
Member Author

justinsb commented Jan 1, 2026

/test pull-kops-kubernetes-e2e-ubuntu-gce-build

I think when we make this periodically re-register we also want to make a failure non-blocking (currently nodeup will fail if discovery.kubedisco.com is offline, which is obviously not what we want), but I don't think that is the problem here

Copy link
Member

@hakman hakman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's ship it! 😁

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 7, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hakman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 7, 2026
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jan 7, 2026

@justinsb: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kops-kubernetes-e2e-ubuntu-gce-build 29cf465 link false /test pull-kops-kubernetes-e2e-ubuntu-gce-build

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hakman
Copy link
Member

hakman commented Jan 7, 2026

/test pull-kops-e2e-cni-cilium-etcd

@k8s-ci-robot k8s-ci-robot merged commit 82f4036 into kubernetes:master Jan 7, 2026
36 of 37 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api area/nodeup cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants