Skip to content

Conversation

@fernando-villalba
Copy link
Collaborator

feat(webhook): add mutating and validating admission webhooks

This commit introduces the pkg/webhook module, implementing the Admission Control layer for the Multigres operator. This ensures that MultigresCluster resources are validated for referential integrity and populated with "visible defaults" before persistence.

Key Features:

  • Mutating Webhook (Defaulter): - Injects the System Catalog (postgres database) if missing.

    • Implements "Visible Defaults": Resolves implicit templates (Level 2) and hydrates the Parent Spec so users see the fully realized configuration in Etcd.
    • Respects API Mutual Exclusion: Explicit templateRef values are preserved (not expanded) to strictly adhere to validation rules forbidding both spec and ref on the same component.
    • Uses a request-scoped Resolver to respect Cluster-level TemplateDefaults, mirroring Controller behavior.
  • Validating Webhook:

    • Referential Integrity: Blocks the creation/update of Clusters referencing non-existent Core, Cell, or Shard templates.
    • Deletion Protection: Prevents the deletion of any Template (Core/Cell/Shard) that is currently referenced by an active MultigresCluster, preventing orphaned references.
    • Child Resource Protection: Blocks direct modification of managed child resources (Cells, Shards) by non-operator users.
  • Infrastructure & Testing:

    • Updated cmd/multigres-operator/main.go to wire up the Webhook Server, including CLI flags for cert management and default templates.
    • Updated Makefile to generate MutatingWebhookConfiguration and ValidatingWebhookConfiguration manifests.
    • Added a high-performance Integration Test suite (pkg/webhook/integration_test.go) using a shared envtest environment. This suite covers happy paths, edge cases (override precedence), and concurrent cache consistency (deletion races).

This commit introduces the `pkg/webhook` module, implementing the Admission Control layer for the Multigres operator. This ensures that `MultigresCluster` resources are validated for referential integrity and populated with "visible defaults" before persistence.

Key Features:
- **Mutating Webhook (Defaulter):** - Injects the System Catalog (`postgres` database) if missing.
  - Implements "Visible Defaults": Resolves implicit templates (Level 2) and hydrates the Parent `Spec` so users see the fully realized configuration in Etcd.
  - Respects API Mutual Exclusion: Explicit `templateRef` values are preserved (not expanded) to strictly adhere to validation rules forbidding both `spec` and `ref` on the same component.
  - Uses a request-scoped `Resolver` to respect Cluster-level `TemplateDefaults`, mirroring Controller behavior.

- **Validating Webhook:**
  - **Referential Integrity:** Blocks the creation/update of Clusters referencing non-existent Core, Cell, or Shard templates.
  - **Deletion Protection:** Prevents the deletion of any Template (Core/Cell/Shard) that is currently referenced by an active MultigresCluster, preventing orphaned references.
  - **Child Resource Protection:** Blocks direct modification of managed child resources (Cells, Shards) by non-operator users.

- **Infrastructure & Testing:**
  - Updated `cmd/multigres-operator/main.go` to wire up the Webhook Server, including CLI flags for cert management and default templates.
  - Updated `Makefile` to generate `MutatingWebhookConfiguration` and `ValidatingWebhookConfiguration` manifests.
  - Added a high-performance Integration Test suite (`pkg/webhook/integration_test.go`) using a shared `envtest` environment. This suite covers happy paths, edge cases (override precedence), and concurrent cache consistency (deletion races).
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

🔬 Go Test Coverage Report

Summary

Coverage Type Result
Threshold 0%
Previous Test Coverage Unknown%
New Test Coverage Unknown%

Status

⚠️ SKIPPED (Build Failed?)

Detail

Show New Coverage
No coverage report generated (Build likely failed).

…itable cert volume

This commit resolves multiple critical issues preventing the Operator's webhook server from starting up successfully in a secure, production-like environment (e.g., Kind).

1. Fix RBAC Permissions:
   The self-signed certificate strategy requires the operator to manage Secrets (for the CA), Services (for the endpoint), and to patch WebhookConfigurations (to inject the CA bundle).
   - Added `// +kubebuilder:rbac` markers to `pkg/webhook/cert/manager.go`.
   - Regenerated `config/rbac/role.yaml` to include permissions for `secrets`, `services`, `mutatingwebhookconfigurations`, and `validatingwebhookconfigurations`.

2. Fix Namespace Mismatch (CrashLoop):
   The operator was crashing because it defaulted to looking for resources in `multigres-system` while running in `multigres-operator`.
   - Updated `config/manager/manager.yaml` to inject the `POD_NAMESPACE` environment variable via the Kubernetes Downward API.
   - Updated `cmd/multigres-operator/main.go` to use `POD_NAMESPACE` as the default for the `--webhook-service-namespace` flag, ensuring the operator effectively "knows where it lives."

3. Fix Read-Only Filesystem Error:
   The deployment enables `readOnlyRootFilesystem: true`, causing the self-signed certificate generator to fail when writing `tls.crt` and `tls.key` to disk.
   - Added an `emptyDir` volume named `cert-dir` in `config/manager/manager.yaml`.
   - Mounted this volume to `/var/run/secrets/webhook` to provide a secure, writable scratch space for ephemeral certificates.

4. Build System:
   - Updated `Makefile` to include `./pkg/webhook/...` in the `controller-gen` object generation paths to ensuring deepcopy methods are generated for webhook-related types.
… resolver scoping

This commit finalizes the "Hybrid Admission Control" implementation by polishing the Webhook Certificate Manager, enforcing "Deep Defaults" (resources/limits) across all components, and fixing critical scoping bugs in the Template Resolver.

Webhook & Certificate Management:
* Auto-Discovery: Removed the manual `--webhook-service-name` flag. The Certificate Manager now automatically discovers the Service targeting the operator's pods (via `control-plane=controller-manager` label) to generate the correct SANs.
* Service Account: Injected `POD_SERVICE_ACCOUNT` env var to allow the webhook to identify the operator's own identity for child-resource validation exemptions.
* Setup: Refactored `webhook.Setup` to use a direct API client for certificate bootstrapping, avoiding cache start-up race conditions.

Resolver & Defaulting Logic:
* Namespace Scoping Bug: Fixed a critical bug in `handlers/defaulter.go` where the Resolver was using the Operator's namespace to look up templates instead of the Request's namespace.
* Deep Defaults: Implemented `DefaultResources*` functions in `resolver/defaults.go` and applied them in `cell.go` and `shard.go`. The Resolver now fully hydrates ResourceRequirements (Requests/Limits) and Storage if missing.
* Smart Defaulting: Updated `resolver/cluster.go` to inject a default `Pool` ("default") when automatically creating the system "0" shard.
* Template Fallbacks: Updated `handlers/validator.go` to skip "NotFound" errors for templates named "default" (Fallback templates), as the Resolver handles these via hardcoded values if they don't exist in the cluster.

Testing & Validation:
* Unit Tests: Updated `resolver/cell_test.go`, `shard_test.go`, and `cluster_test.go` to expect fully hydrated objects (with default resources and replicas) instead of empty structs.
* Integration Tests: Updated `integration_test.go`, `integration_lifecycle_test.go`, and `integration_resolution_enforcement_test.go` to match the new behavior where the controller enforces defaults immediately.
* Cert Manager Tests: Fixed `cert/manager_test.go` to mock the Service and WebhookConfigurations correctly for the new auto-discovery logic.

Dev Infrastructure:
* Samples: Added `config/samples/default-templates/` containing standard Core, Cell, and Shard templates.
* Makefile: Added `kind-deploy-no-webhook` target and a `config/no-webhook` Kustomize profile to allow deploying the operator in a simplified mode for debugging.
* Kustomize: Enabled the webhook component by default in `config/default`.
Refactors the webhook setup to use the controller-runtime builder pattern,
modernizes handlers to implement CustomValidator/Defaulter interfaces, and
overhauls internal certificate generation to be a managed background process.

Detailed Changes:

1. Webhook Registration & Handlers (Refactor)
   - Switched pkg/webhook/setup.go to use ctrl.NewWebhookManagedBy(mgr).
     This replaces manual server.Register calls and eliminates path drift
     between Go code and Kubebuilder markers.
   - Refactored handlers/defaulter.go and handlers/validator.go to
     implement webhook.CustomDefaulter and webhook.CustomValidator interfaces,
     removing manual JSON decoding logic.

2. Internal Certificate Management (Overhaul)
   - Converted pkg/webhook/cert from a static manager to a CertRotator that
     implements manager.Runnable.
   - Added a Bootstrap() phase in main.go to ensure certificates exist
     on disk before the webhook server binds to the port.
   - Implemented auto-detection: checks if certificates exist (e.g., from
     Cert Manager) and skips internal rotation if they do.
   - Added logic to automatically patch Mutating/ValidatingWebhookConfigurations
     with the generated CA bundle on every rotation.
   - Secrets now attempt to set the Operator's Deployment as an owner for
     garbage collection.

3. Configuration & RBAC
   - Updated api/v1alpha1/multigrescluster_types.go with RBAC permissions
     to manage Secrets and patch WebhookConfigurations.
   - Updated main.go to disable caching for Secret and Deployment objects
     to allow bootstrapping before the cache is fully started.
   - Deprecated explicit cert-strategy flags in favor of auto-detection.

This change was smoke tested, the tests need to be fixed, will be done on next commit.
Refactors webhook handlers to use typed interfaces, hardens integration
tests against race conditions, and tunes certificate rotation settings.

Detailed changes:
- tests(tablegroup): wrap updates in RetryOnConflict to fix flaky
  TestTableGroup_Lifecycle caused by controller/test race conditions.
- refactor(webhook): migrate Defaulter and Validator handlers to
  implement controller-runtime's CustomDefaulter and CustomValidator
  interfaces, simplifying logic and fixing build errors.
- fix(webhook/cert): increase internal certificate rotation threshold
  to 30 days (from 24h) to ensure ample recovery time during failures.
- test(webhook/cert): update manager tests to validate DNS names and
  correctly assert rotation behavior with the new threshold.
- style: fix staticcheck ST1005 error string formatting in validator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants