diff --git a/plans/.gitkeep b/plans/.gitkeep deleted file mode 100644 index e69de29b..00000000 diff --git a/plans/phase-1/api-design/additional-context/data-relationships.md b/plans/phase-1/api-design/additional-context/data-relationships.md new file mode 100644 index 00000000..7d76bd0f --- /dev/null +++ b/plans/phase-1/api-design/additional-context/data-relationships.md @@ -0,0 +1,102 @@ +--- +title: MultigresCluster Canonical Data Sources and Relationships +state: ready +--- + +## Summary +> This document was provided by Sugu to explain how database and shard schema is stored in a postgres database and the problem of having two sources of truth for this databases definition: Kubernetes Cluster and Multigres. + + +Multigres is going to have metadata in multiple places and in some cases, they have to be duplicated. If so, we have define which one is the canonical source of truth. We will also define the relationships between the different data sources. + +## List of Data Sources + +In a Kubernetes cluster the following data sources are used: + +- Kubernetes CRD templates: This contains all the CRDs for the Multigres Operator. This is static. Example: `MultigreCluster` type. +- Kubernetes CRs: These are Kubernetes meta-objects that are instances of the CRDs. Kubernetes lets the operator manage their lifecycle. Example: `MultigreCluster` objects. +- Kubernetes objects: These are real objects like pods and services. + +For simplicity, we'll unify all of the above data sources as one storage type: Kubernetes Store. + +In Multigres, we have the following data sources: + +- Global Toposerver: This stores the name of each MultigresCluster. Under each, it stores the per-cell metadata: + - Cell Name + - Local Toposerver Address + +- Local Toposerver: Components running within a cell register themselves in this server. Components can discover each other using this server. + +- Default TableGroup metadata: Simply put, this is a hidden area within the default database of the first Postgres server created for the MultigresCluster. In Multigres parlance, this first database is the only shard of the "default" tablegroup, which is inside the Multigres default database. Just like every Postgres server creates a default database, Multigres creates a logical default database that contains this unsharded tablegroup with a single shard. Essentially, `Multigres->Default Database->Default TableGroup->Only Shard` is the same as `Postgres->Default Database`. This metadata area contains: + - A list of databases. + - For each database, a list of tablegroups, of which the default is always present. + - For each tablegroup, a list of shards. For the default tablegroup, it's always a single shard, which is the current database. + +| Postgres | Physical Multigres | +|----------|---------------------| +| Server | Multigres Cluster | +| Database | Multigres Database->TableGroups->Shards (physical PG servers) | +| Default Database | Multigres Default Database->Default TableGroup->Only Shard | +| create table t | t is created in: Multigres Default Database->Default TableGroup->Only Shard | +| (new syntax) | Create tablegroup tg1 for db1 | +| create table t1 in tg1 (new syntax) | Table t1 in all shards of tg1 | + +In other words, a multigres cluster would have the following metadata: +- default database + - default tablegroup + - only shard +- db1 + - default tablegroup + - only shard + - tg1 + - shard "0-8" + - shard "8-inf" + +In this structure, we have a chicken-and-egg problem because this information also exists in the Kubernetes Store under the `MultigresCluster` custom resource. + +## Life of a cluster + +### Cluster initialization + +In the beginning, we have the Multigres Operator (MGO) running. In Kubernetes, CRD templates are already registered. + +MGO receives a command to create a new MultigresCluster. Let us assume: + +- Cluster name cluster1 +- Single cell c1 +- Single components in c1: MultiPooler, MultiGateway, and MultiOrch + +MGO Actions: + +- MGO creates (or reuses) a Global Toposerver. Root path: `/multigres`. +- Creates a new entry: `/multigres/cluster1/global` +- Under the above entry, it creates the following subentries: + - `/multigres/cluster1/global/cell1` + - `/multigres/cluster1/global/cell1/local-toposerver`: This contains the address of the local toposerver. +- MGO creates (or reuses) a Local Toposerver. Root path: `/multigres/cluster1/cell1`. No subentries at this point. +- MGO launches the pods and services for MultiPooler, MultiGateway, and MultiOrch. + +Multigres Actions: + +- MultiPooler comes up and registers itself with the local toposerver under `/multigres/cluster1/cell1/`. +- MultiPooler initializes an empty postgres database. In this database, it creates a hidden area for the multigres metadata. In it, it adds a default tablegroup, and a single shard, which represents the current database. +- MultiOrch discovers the MultiPooler, notices that it's a single instance situation, appoints the MultiPooler as Primary, and requests it to accept read-write traffic. +- MultiGateway discovers MultiPooler. It's for the default tablegroup. It fetches the tablegroup metadata from MultiPooler, and discovers only the default tablegroup. It sets itself up to serve traffic for this tablegroup. + +Users: + +- Connect to MultiGateway (with no database name). This is implicitly resolved to the default tablegroup. They can create tables and use this database just like the default database of Postgres. + +#### What is the problem + +The default tablegroup info is present in two places: +- It's (implicitly) present in the Multigres Cluster definition. +- It's present in the postgres database under the hidden area created by MultiPooler. + +One may argue that the Multigres Cluster definition is the canonical source of truth. However, this will put the burden of maintaining the database metadata on the MultiPooler, which is not ideal. Also, this increases the surface area between MGO and Multigres. + +The alternative is to have the database be the source of truth, and MGO drive off of that. We'll analyze options arising out of this approach below. + +### Create Database + +User issues `CREATE DATABASE db1;` \ No newline at end of file diff --git a/plans/phase-1/api-design/considered-not-used/multigres-operator-api-v1alpha1-design-ddl-personas.md b/plans/phase-1/api-design/considered-not-used/multigres-operator-api-v1alpha1-design-ddl-personas.md new file mode 100644 index 00000000..19433bd3 --- /dev/null +++ b/plans/phase-1/api-design/considered-not-used/multigres-operator-api-v1alpha1-design-ddl-personas.md @@ -0,0 +1,1240 @@ +# MultigresCluster API v1alpha1 + +>This design was considered and discussed with the Multigres team but eventually rejected due to being overly complex for the initial stages of multigres operator. + +## Summary + +This proposal defines the `v1alpha1` API for the Multigres Operator. The design is centered on a root `MultigresCluster` resource, which defines the core infrastructure, and a `MultigresDatabase` resource, which defines the logical database topology. + +This new design separates the **logical** definition of a database (its tablegroups and sharding) from the **physical** definition (the compute, memory, and storage resources for its shards). This provides a clear separation of concerns between Platform/SREs (managing the cluster) and DBAs/Developers (managing the databases). It also enables us to define an agnostic database and sharding spec that could potentially be utilized in other platforms that are not Kubernetes. + +1. **`MultigresCluster`**: The root resource defining the desired state of the entire cluster's core infrastructure (control plane, cells, and the implicit system catalog database). +2. **`CoreTemplate`**: A reusable, namespaced resource for defining standard configurations for core components. +3. **`CellTemplate`**: A reusable, namespaced resource for defining standard configurations for Cell-level components. +4. **`MultigresDatabase`**: A namespaced, user-managed resource that defines the **logical** topology of a single database. It acts as a "claim" on a `MultigresCluster`, enabling namespace-scoped RBAC. +5. **`MultigresDatabaseResources`**: A namespaced, user-managed resource that defines the **physical** resource allocation (compute, storage, templates) for the shards of a specific `MultigresDatabase`. +6. **`ShardTemplate`**: A reusable, namespaced resource for defining standard configurations for Shard-level components (`MultiOrch` and `Pools`). + +The `MultigresCluster` owns core components (`TopoServer`, `Cell`). The `MultigresDatabase` CR becomes the owner of `TableGroup` CRs, which in turn own the `Shard` CRs. + +## Motivation + +Managing a distributed, sharded database system is complex. This revised parent/child "claim" model enhances the separation of concerns, providing distinct APIs for different personas and enabling new workflows if desired. + + * **Separation of Concerns (by Persona):** + * **Platform (SRE):** Manages the `MultigresCluster` (cells, templates, core components) in a central namespace. + * **Application (DBA):** Manages the `MultigresDatabase` (logical sharding) in their own application namespace. + * **Infrastructure (SRE/Platform):** Manages the `MultigresDatabaseResources` (assigning physical resources to logical databases), also in the application namespace. + * **DDL-driven Workflow:** This model enables a new DB creation flow where a `CREATE DATABASE` DDL command can be intercepted to automatically generate the `MultigresDatabase` CR, with a default `MultigresDatabaseResources` CR being created automatically by the operator. + * **GitOps vs. DDL:** By generating and applying the CRs based on the imperative DDL, we can safely allow DDL-driven creation while still allowing GitOps to manage the underlying resource templates and cluster configuration, solving potential state conflicts. + * **Scoped Reusability:** By splitting templates into `CoreTemplate`, `CellTemplate`, and `ShardTemplate`, we provide clear, reusable configurations. + +## Proposal: API Architecture and Resource Topology + +### 1\. Cluster Infrastructure Topology + +This diagram shows what the `MultigresCluster` **directly owns and manages** as its core infrastructure. + +```ascii +[MultigresCluster] πŸš€ (Root CR - Manages Core Infrastructure) + β”‚ + β”œβ”€β”€ 🌍 [GlobalTopoServer] (Child CR) + β”‚ + β”œβ”€β”€ πŸ€– MultiAdmin Resources (Child) + β”‚ + β”œβ”€β”€ πŸ“‡ SystemCatalog (Child - "Default" DB's Shard) + β”‚ + └── πŸ’  [Cell] (Child CR) + β”‚ + β”œβ”€β”€ πŸšͺ MultiGateway Resources + └── πŸ“‘ [LocalTopoServer] (Child CR, optional) +``` + +### 2\. Database "Claim" Topology + +This diagram shows the **decoupled "claim" model** for a user-created database. It shows how the database objects live in their own namespace and relate to each other and the cluster. + +```ascii +Kubernetes Namespace: "platform-system" ++---------------------------------------+ +| [MultigresCluster] πŸš€ | +| (Watches for claims) | ++---------------------------------------+ + ^ + β”‚ +(Operator resolves this "clusterRef") + β”‚ +Kubernetes Namespace: "app-team-1" ++---------------------------------------+ +| [MultigresDatabase] πŸ—ƒοΈ (Logical "Claim") | +| spec: | +| clusterRef: "example-cluster" | ++---------------------------------------+ + β”‚ β”‚ + β”‚ (Owns) β”‚ (Referenced by) + β”‚ β”‚ + β”‚ +-> [MultigresDatabaseResources] πŸ–₯️ + β”‚ spec: + β”‚ databaseRef: "production-db" + β”‚ + β”‚ (Controller merges logical + physical) + β”‚ + β””-> πŸ’  [TableGroup] (Child CR) + β”‚ + └── πŸ“¦ [Shard] (Child CR) + β”‚ + β”œβ”€β”€ 🧠 MultiOrch Resources + └── 🏊 Pools (Postgres+MultiPooler) +``` + +----- + +## DDL-driven vs. GitOps-driven Workflow + +This API structure is designed to support two primary methods of database creation, which can coexist. + +### The DDL DB Creation Flow (Platform Agnostic) + +By default, MultigresCluster accepts DDL queries to create databases and the required infrastructure. This enables other users (e.g., DBAs or application developers) to create a database using SQL commands without defining infrastructure. It works like this: + +1. A user defines the **logical** topology via a DDL command: + ```sql + CREATE DATABASE "events" WITH SHARDING (TableGroup 'tg1' (Shard '0' (keyRange '0-80'), Shard '1' (keyRange '80-inf'))); + ``` +2. `MultiGateway` intercepts this command and converts it into a platform agnostic YAML spec and hands it to the operator to convert and apply into a `MultigresDatabase` to the user's default namespace or whichever namespace matches their role. +3. The `multigres-database-controller` sees the new `MultigresDatabase` CR. +4. The controller **automatically creates** a corresponding `MultigresDatabaseResources` CR (e.g., `events-resources`) in the same namespace. +5. Crucially, the controller reads `MultigresCluster.spec.templateDefaults.shardTemplate` (e.g., `"cluster-wide-shard"`) and populates `MultigresDatabaseResources.spec.defaultShardTemplate` with this value. +6. A separate `multigres-resource-controller` sees the `MultigresDatabaseResources` CR, merges it with the `MultigresDatabase` CR, and reconciles the `TableGroup` and `Shard` child resources, provisioning pods and storage. +7. Once the physical resources are ready, the operator signals `MultiAdmin` (or whichever component we end up deciding on in the end) to run the final DDL to make the logical database available. This must also be state driven so we don't do a fire and forget and then the logical DB is not created. + +### Resource Upgrades (Separation of Duties) + +Removing resource definitions from the `MultigresDatabase` CR creates a clear separation of duties. A user (e.g., an SRE) with permission to edit `MultigresDatabaseResources` can apply a change to upgrade a database's resources without touching the `MultigresDatabase` CR that a DBA might own. + +### Solving GitOps vs. DDL Chaos + +This separation allows DDL-created and Git-created databases to live in the same cluster. SREs can still create databases declaratively by applying `MultigresDatabase` and `MultigresDatabaseResources` CRs via Git. + +For DDL-driven databases, we can advise users to add an annotation to the `MultigresDatabase` CR. This tells a GitOps tool like Argo CD to ignore the resource, preventing it from being altered if it's not in Git. We could allow the defaulting these annotations in MultigresCluster too. + +> ```yaml +> apiVersion: multigres.com/v1alpha1 +> kind: MultigresDatabase +> metadata: +> name: "prod-analytics-db" +> annotations: +> # Tells Argo CD to "hands off" +> "argocd.argoproj.io/ignore-resource": "true" +> spec: +> ... +> ``` + +----- + +## Design Details: API Specification + +### User Managed CR: MultigresCluster + + * This CR and the three scoped templates (`CoreTemplate`, `CellTemplate`, `ShardTemplate`) are the primary editable entries for the **Platform Team**. + * This CR no longer contains `spec.databases`. Instead, it defines the `spec.systemCatalog` which is the implicit "default database". This default database does not create a `MultigresDatabase` CR inside the cluster and cannot be deleted. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresCluster +metadata: + name: example-cluster + # Typically lives in a central 'platform-system' namespace + namespace: platform-system +spec: + # ---------------------------------------------------------------- + # Global Images Cluster Configuration + # ---------------------------------------------------------------- + images: + imagePullPolicy: "IfNotPresent" + imagePullSecrets: + - name: "my-registry-secret" + multigateway: "multigres/gateway:latest" + multiorch: "multigres/orch:latest" + multipooler: "multigres/pooler:latest" + multiadmin: "multigres/admin:latest" + postgres: "postgres:15.3" + # NOTE: etcd image is NOT defined here. It is defined + # in the 'managedSpec' of the toposerver components. + + # ---------------------------------------------------------------- + # Cluster-Wide Policies + # ---------------------------------------------------------------- + policy: + # Set to true to allow 'CREATE DATABASE' DDL + # to automatically create MultigresDatabase CRs. + # If false, DDL commands will be rejected. + # This would need to trigger a configuration within MultiGateway to explicitly disable these DDLs + # Default is true. + enableDDLDatabaseCreation: true + + # Default annotations to add to DDL-created + # MultigresDatabase CRs to protect them from GitOps tools. + defaultDatabaseAnnotations: + "argocd.argoproj.io/ignore-resource": "true" + # "another-tool.com/hands-off": "true" + + # Policy specifying what namespaces should be used to deploy and watch for DBs + # ONLY namespaces specified here are watched by the controller + databaseNamespaces: + # A list of mappings from a Postgres ROLE (group) + # to a target Kubernetes namespace. + # This tells the operator on what namespace the databases should be created depending on the role that creates them. + # The first match for the user's role is used. + roleToNamespace: + - role: "analytics_team" + namespace: "analytics-prod" + - role: "payments_team" + namespace: "payments-prod" + - role: "engineering_team" + namespace: "engineering-dev" + + # The default namespace to use if a user's role + # does not match any of the mappings above. + fallbackNamespace: "default-db-claims" + + # ---------------------------------------------------------------- + # Cluster-Level Template Defaults + # ---------------------------------------------------------------- + # These are the default templates to use for any component that + # does not have an inline 'spec' or an explicit 'templateRef'. + templateDefaults: + coreTemplate: "cluster-wide-core" + cellTemplate: "cluster-wide-cell" + # This is the default ShardTemplate used when a + # MultigresDatabase is created via DDL. The operator + # will copy this name into the auto-created + # MultigresDatabaseResources.spec.defaultShardTemplate field. + shardTemplate: "cluster-wide-shard" + + # ---------------------------------------------------------------- + # Global Components + # ---------------------------------------------------------------- + globalTopoServer: + # --- OPTION 1: Inline Managed Spec --- + managedSpec: + image: "quay.io/coreos/etcd:v3.5" + replicas: 3 + storage: + size: "10Gi" + class: "standard-gp3" + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" + + # --- OPTION 2: Inline External Spec --- + # external: + # endpoints: + # - "https://etcd-1.infra.local:2379" + # - "https://etcd-2.infra.local:2379" + # caSecret: "etcd-ca-secret" + # clientCertSecret: "etcd-client-cert-secret" + + # --- OPTION 3: Explicit Template Reference --- + # templateRef: "my-explicit-core-template" + + multiadmin: + # --- OPTION 1: Inline Managed Spec --- + managedSpec: + replicas: 2 + resources: + requests: + cpu: "200m" + memory: "256Mi" + limits: + cpu: "500m" + memory: "512Mi" + + # --- OPTION 2: Explicit Template Reference --- + # templateRef: "my-explicit-core-template" + + # ---------------------------------------------------------------- + # System Catalog (Implicit "Default Database") + # ---------------------------------------------------------------- + # This is the bootstrapped, hidden "default" database that stores + # the list of all other databases and tablegroups. + # It is NOT a user database and does NOT get a MultigresDatabase CR. + systemCatalog: + # This spec defines the physical resources for the + # "default" database's single shard. It can use a template + # or an inline spec. + shardTemplate: "system-catalog-shard" + overrides: + pools: + primary: + cell: "us-east-1a" # Must be pinned to a cell + + # ---------------------------------------------------------------- + # Cells Configuration + # ---------------------------------------------------------------- + cells: + - name: "us-east-1a" + zone: "us-east-1a" + cellTemplate: "standard-cell-ha" + + - name: "us-east-1b" + zone: "us-east-1b" + spec: + multiGateway: + replicas: 2 + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" + + # ---------------------------------------------------------------- + # Database Topology (REMOVED) + # ---------------------------------------------------------------- + # spec.databases has been REMOVED from this resource. + # This is now defined in separate `MultigresDatabase` + # and `MultigresDatabaseResources` CRs in user namespaces. + +status: + observedGeneration: 1 + conditions: + - type: Available + status: "True" + message: "All core components are available." + cells: + us-east-1a: + ready: True + gatewayReplicas: 3 + us-east-1b: + ready: True + gatewayReplicas: 2 +``` + +### User Managed CR: CoreTemplate + + * The `etcd` image is now defined within the `globalTopoServer.managedSpec`. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: CoreTemplate +metadata: + name: "standard-core-ha" + namespace: platform-system +spec: + # Defines the Global TopoServer component + globalTopoServer: + # --- OPTION 1: Managed by Operator --- + managedSpec: + image: "quay.io/coreos/etcd:v3.5" + replicas: 3 + storage: + size: "10Gi" + class: "standard-gp3" + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" + # --- ALTERNATIVE OPTION 2: External Etcd --- + # external: + # endpoints: + # - "https://etcd-1.infra.local:2379" + # caSecret: "etcd-ca-secret" + # clientCertSecret: "etcd-client-cert-secret" + + # Defines the MultiAdmin component + multiadmin: + managedSpec: + replicas: 2 + resources: + requests: + cpu: "200m" + memory: "256Mi" + limits: + cpu: "500m" + memory: "512Mi" +``` + +### User Managed CR: CellTemplate + + * The `etcd` image is now defined within the `localTopoServer.managedSpec`. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: CellTemplate +metadata: + name: "standard-cell-ha" + namespace: platform-system +spec: + # Template strictly defines only Cell-scoped components. + multiGateway: + replicas: 2 + resources: + requests: + cpu: "500m" + memory: "512Mi" + limits: + cpu: "1" + memory: "1Gi" + affinity: + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchLabels: + app.kubernetes.io/component: multigateway + topologyKey: "kubernetes.io/hostname" + + # --- OPTIONAL: Local TopoServer --- + # Define if cells using this template should have their own dedicated ETCD. + localTopoServer: + managedSpec: + image: "quay.io/coreos/etcd:v3.5" + replicas: 3 + storage: + class: "standard-gp3" + size: "5Gi" + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" +``` + +### User Managed CR: MultigresDatabase + + * This new CR defines the **logical** structure of a database. + * `keyRange` is defined at the `shards` level. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresDatabase +metadata: + name: "production-db" + namespace: app-team-1 + annotations: + # Allows DDL-driven DBs to coexist with GitOps + "argocd.argoproj.io/ignore-resource": "true" +spec: + # Reference to the MultigresCluster (can be in another namespace) + clusterRef: + name: "example-cluster" + namespace: "platform-system" + + # The logical database name + databaseName: "production_db" + + # Defines the logical sharding layout + tablegroups: + - name: "orders_tg" + # The tablegroup itself is just a logical container + # Shards define the key ranges + shards: + - name: "0" + keyRange: + start: "0" + end: "80" + - name: "1" + keyRange: + start: "80" + end: "inf" + + - name: "users_tg" + # This tablegroup has a single shard, covering the whole range + shards: + - name: "0" + keyRange: + start: "0" + end: "inf" +``` + +### User Managed CR: MultigresDatabaseResources + + * This new CR provides the **physical** specification for a `MultigresDatabase`. + * Comments are updated to clarify the `defaultShardTemplate` workflow. + * These live within the same namespace as `MultigresDatabase`. Platform teams that want a clean separation of responsibilities may want to deny editing this, or add resourcing limits at namespace level. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresDatabaseResources +metadata: + # Often named after the database it supports + name: "production-db-resources" + namespace: app-team-1 +spec: + # Reference to the logical database in the same namespace + databaseRef: "production-db" + + # Default template for all shards in this DB that + # do not have an explicit 'shardTemplate' defined below. + # + # NOTE: When this CR is auto-created by the DDL workflow, + # this field will be populated from + # 'MultigresCluster.spec.templateDefaults.shardTemplate'. + # + # An SRE/Infra user can override that default by + # setting this field explicitly. + defaultShardTemplate: "standard-shard-ha" + + # The physical specs, mirroring the logical structure. + tablegroups: + - name: "orders_tg" # MUST match name in MultigresDatabase + shards: + # --- SHARD 0: Using an Explicit Template --- + - name: "0" # MUST match name in MultigresDatabase + shardTemplate: "high-cpu-shard" + overrides: + pools: + primary: + cell: "us-east-1a" + + # --- SHARD 1: Using Inline Spec --- + - name: "1" + spec: + multiOrch: + replicas: 1 + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + pools: + primary: + type: "readWrite" + cell: "us-east-1b" + replicas: 2 + storage: + size: "100Gi" + class: "standard-gp3" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" + + - name: "users_tg" + shards: + # --- SHARD 0: Using Database Default Template --- + - name: "0" + # 'spec' and 'shardTemplate' are omitted. + # This will use 'spec.defaultShardTemplate' ("standard-shard-ha") + overrides: + pools: + primary: + cell: "us-east-1a" +``` + +### User Managed CR: ShardTemplate + + * This CR is **unchanged**. It defines the "shape" of a shard: its orchestration and its data pools. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: ShardTemplate +metadata: + name: "standard-shard-ha" + namespace: platform-system +spec: + # Template strictly defines only Shard-scoped components. + + # MultiOrch is a shard-level component (one per Raft group). + multiOrch: + replicas: 1 + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + + # MAP STRUCTURE: Keyed by pool name for safe targeting. + pools: + primary: + type: "readWrite" + # 'cell' MUST be overridden in MultigresDatabaseResources + # if left empty here. + cell: "" + replicas: 2 + storage: + class: "standard-gp3" + size: "100Gi" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "1" + memory: "512Mi" + limits: + cpu: "2" + memory: "1Gi" + + dr-replica: + type: "readOnly" + cell: "us-west-2a" # Hardcoded cell example in template + replicas: 1 + storage: + class: "standard-gp3" + size: "100Gi" + postgres: + resources: + requests: + cpu: "1" + memory: "2Gi" + limits: + cpu: "2" + memory: "4Gi" + multipooler: + resources: + requests: + cpu: "500m" + memory: "512Mi" + limits: + cpu: "1" + memory: "1Gi" +``` + +----- + +### Child Resources (Read-Only) + + +### Child CR: TopoServer + + * Applies to both Global (owned by `MultigresCluster`) and Local (owned by `Cell`) topology servers. + * This CR does *not* exist if the user configures an `external` etcd connection in the parent. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: TopoServer +metadata: + name: "example-cluster-global" + namespace: platform-system + ownerReferences: + - apiVersion: multigres.com/v1alpha1 + kind: MultigresCluster + name: "example-cluster" + controller: true +spec: + # Fully resolved 'managedSpec' from MultigresCluster inline definition + image: "quay.io/coreos/etcd:v3.5" + replicas: 3 + storage: + size: "10Gi" + class: "standard-gp3" + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" +status: + conditions: + - type: Available + status: "True" + lastTransitionTime: "2025-11-07T12:01:00Z" + clientService: "example-cluster-global-client" + peerService: "example-cluster-global-peer" +``` + +----- + +### Child CR: Cell + + * Owned by `MultigresCluster`. + * Strictly contains `MultiGateway` and optional `LocalTopoServer`. + * The `allCells` field is used for discovery by gateways. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: Cell +metadata: + name: "example-cluster-us-east-1a" + namespace: platform-system + labels: + multigres.com/cluster: "example-cluster" + multigres.com/cell: "us-east-1a" + ownerReferences: + - apiVersion: multigres.com/v1alpha1 + kind: MultigresCluster + name: "example-cluster" + uid: "a1b2c3d4-1234-5678-90ab-f0e1d2c3b4a5" + controller: true +spec: + name: "us-east-1a" + zone: "us-east-1a" + + # Images passed down from global configuration + images: + multigateway: "multigres/gateway:latest" + + # Resolved from CellTemplate + Overrides + multiGateway: + replicas: 3 + resources: + requests: + cpu: "500m" + memory: "512Mi" + limits: + cpu: "1" + memory: "1Gi" + + # A reference to the GLOBAL TopoServer. + # Always populated by the parent controller if no local server is used. + globalTopoServer: + rootPath: "/multigres/global" + clientServiceName: "example-cluster-global-client" + + # Option 1: Using the Global TopoServer (Default) + # + topoServer: {} + + # Option 2: Inline Definition (External) + # + # topoServer: + # external: + # address: "etcd-us-east-1a.my-domain.com:2379" + # rootPath: "/multigres/us-east-1a" + + # Option 3: Managed Local + # + # topoServer: + # managedSpec: + # rootPath: "/multigres/us-east-1a" + # image: "quay.io/coreos/etcd:v3.5" + # replicas: 3 + # storage: + # size: "5Gi" + # class: "standard-gp3" + + # List of all cells in the cluster for discovery. + allCells: + - "us-east-1a" + - "us-east-1b" + + # Topology flags for the Cell controller to act on. + topologyReconciliation: + registerCell: true + pruneTablets: true + +status: + conditions: + - type: Available + status: "True" + lastTransitionTime: "2025-11-07T12:01:00Z" + gatewayReplicas: 3 + gatewayReadyReplicas: 3 + gatewayServiceName: "example-cluster-us-east-1a-gateway" +``` + +#### Child CR: TableGroup + + * Owned by `MultigresDatabase` (no longer `MultigresCluster`). + * Its `spec` is populated by a controller that **merges** `MultigresDatabase` (logical) and `MultigresDatabaseResources` (physical). + * `keyRange` is no longer here; it's now part of the logical spec in the `shards` list. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: TableGroup +metadata: + name: "production-db-orders-tg" + namespace: app-team-1 + labels: + multigres.com/database: "production_db" + multigres.com/tablegroup: "orders_tg" + ownerReferences: + - apiVersion: multigres.com/v1alpha1 + kind: MultigresDatabase + name: "production-db" + controller: true +spec: + databaseName: "production_db" + tableGroupName: "orders_tg" + + # --- Logical & Physical Spec (Merged by Controller) --- + # This is the list of FULLY RESOLVED shard specifications. + shards: + - name: "0" + # --- Logical Spec (from MultigresDatabase) --- + keyRange: + start: "0" + end: "80" + # --- Physical Spec (from MultigresDatabaseResources) --- + multiOrch: + replicas: 1 + image: "multigres/orch:latest" + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + pools: + primary: + cell: "us-east-1a" + type: "readWrite" + replicas: 2 + storage: + size: "100Gi" + class: "standard-gp3" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "1" + memory: "512Mi" + limits: + cpu: "2" + memory: "1Gi" + + - name: "1" + # --- Logical Spec (from MultigresDatabase) --- + keyRange: + start: "80" + end: "inf" + # --- Physical Spec (from MultigresDatabaseResources) --- + multiOrch: + replicas: 1 + image: "multigres/orch:latest" + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + pools: + primary: + cell: "us-east-1b" + type: "readWrite" + replicas: 2 + storage: + size: "100Gi" + class: "standard-gp3" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" + +status: + readyShards: 2 + totalShards: 2 +``` + +----- + +### Child CR: Shard + + * Owned by `TableGroup`. + * Now contains `MultiOrch` (Raft leader helper) AND `Pools` (actual data nodes). + * Represents one entry from the `TableGroup` shards list. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: Shard +metadata: + name: "production-db-orders-tg-0" + namespace: app-team-1 + labels: + multigres.com/shard: "0" + multigres.com/database: "production_db" + multigres.com/tablegroup: "orders_tg" + ownerReferences: + - apiVersion: multigres.com/v1alpha1 + kind: TableGroup + name: "production-db-orders-tg" + controller: true +spec: + # 'shardName' and 'keyRange' are passed down + # from the parent TableGroup spec. + shardName: "0" + keyRange: + start: "0" + end: "80" + + # Fully resolved from parent TableGroup spec + multiOrch: + replicas: 1 + image: "multigres/orch:latest" + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + + pools: + primary: + cell: "us-east-1a" + type: "readWrite" + replicas: 2 + storage: + size: "100Gi" + class: "standard-gp3" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "1" + memory: "512Mi" + limits: + cpu: "2" + memory: "1Gi" +status: + primaryCell: "us-east-1a" + orchReady: True + poolsReady: True +``` +## Defaults via Webhooks + +### Cluster Component Override Chain (4-Level) + +This logic is applied consistently by the webhook for `GlobalTopoServer`, `MultiAdmin`, and `Cells`: + +1. **Component-Level Definition (Highest):** An inline `spec` or explicit `templateRef` on the component in the `MultigresCluster` CR. +2. **Cluster-Level Default Template:** The corresponding template in `MultigresCluster.spec.templateDefaults` (e.g., `templateDefaults.coreTemplate`). +3. **Namespace-Level Default Template:** A template of the correct kind (e.g., `CoreTemplate`) named `default` in the cluster's namespace. +4. **Operator Hardcoded Defaults (Lowest):** A final fallback applied by the operator. + +### Shard Override Chain (4-Level) + +This logic is applied by the controller when resolving the `TableGroup` spec: + +1. **Shard-Level Definition (Highest):** An inline `spec` or explicit `shardTemplate` on the entry in `MultigresDatabaseResources.spec.tablegroups.shards`. +2. **Database-Level Default Template:** The `spec.defaultShardTemplate` field in the `MultigresDatabaseResources` CR. +3. **Namespace-Level Default Template:** A `ShardTemplate` named `default` in the same namespace as the `MultigresDatabaseResources` CR. +4. **Operator Hardcoded Defaults (Lowest):** A final fallback applied by the operator. + + +## Webhook Validations and Controller Logic + +This section outlines the separation of duties between synchronous webhooks and asynchronous controller logic (finalizers and status) to ensure a robust, non-blocking, and declarative API. + +### 1. Synchronous Validating Webhooks + +Webhooks are used *only* for fast, synchronous, and semantic validation. + +#### `MultigresCluster` +* **On `CREATE` and `UPDATE`:** + * **Internal Consistency:** Validates that `spec.systemCatalog.overrides.pools.primary.cell` matches a `name` in the `spec.cells` list. + * **Template Existence:** Validates that templates named in `spec.templateDefaults` (e.g., `coreTemplate`) exist *at the time of creation/update*. This is an acceptable synchronous check as platform components are often bootstrapped together. + * **Component Spec:** Enforces that `managedSpec`, `external`, and `templateRef` are mutually exclusive. + * **Uniqueness:** Validates that all `name` entries in `spec.cells` are unique. + * **Note:** Namespace existence checks are **not** performed by the webhook. This is handled by the controller's reconcile loop to support declarative bootstrapping (e.g., `helm install` or `kubectl apply -f .`). + +#### `MultigresDatabase` +* **On `CREATE` and `UPDATE`:** + * **Cluster Reference:** Verifies that the `MultigresCluster` specified in `spec.clusterRef` (e.g., `platform-system/example-cluster`) exists. + * **Namespace Policy:** Verifies that this `MultigresDatabase` is being created in a namespace that is "allowed" by the target `MultigresCluster`'s `spec.policy.databaseNamespaces` list. + * **Topology Validation:** Verifies that `tablegroups` and `shards` names are unique within the spec and that `keyRange` values are valid. + +#### `MultigresDatabaseResources` +* **On `CREATE` and `UPDATE`:** + * **Database Reference:** Verifies that the `MultigresDatabase` named in `spec.databaseRef` exists **in the same namespace**. + * **Topology Consistency:** Performs a critical check by fetching the referenced `MultigresDatabase` and ensuring the `tablegroups` and `shards` defined here are a perfect subset of the logical topology. It will reject if this CR defines a physical resource for a logical shard that doesn't exist. + * **Template/Cell Existence:** Verifies that the `ShardTemplate` in `spec.defaultShardTemplate` (and any per-shard `shardTemplate`) exists. It also verifies any `cell` name in a `pools` override exists on the target `MultigresCluster`. + +--- + +### 2. Asynchronous Controller and Finalizer Logic + +Asynchronous logic (deletion protection, external dependencies) is handled by controllers and finalizers. + +#### `MultigresCluster` +* **Namespace Existence (Controller Status):** The `MultigresCluster` controller (not the webhook) reconciles `spec.policy.databaseNamespaces`. If a listed namespace is not found, the controller will update its *own status* with a Condition (e.g., `Type: "PolicyReady", Status: "False", Reason: "MissingNamespace"`) and requeue. It will not block. +* **Deletion Protection (Finalizer):** + 1. The `MultigresCluster` controller adds a **finalizer** (e.g., `multigres.com/database-claims-check`) to its own CR. + 2. When a user deletes the `MultigresCluster`, the `DeletionTimestamp` is set. + 3. The controller's reconcile loop sees this and LISTs all `MultigresDatabase` CRs in the cluster. + 4. If claims exist, the controller updates the `MultigresCluster.status` with a "Terminating" Condition and requeues. It **does not** remove the finalizer. + 5. Only when all claims are gone does the controller remove its finalizer, allowing Kubernetes to complete the deletion. + +#### `MultigresDatabaseResources` +* **Deletion Protection (Owner Reference):** This CR's deletion is handled by standard Kubernetes garbage collection. The controller that creates the `MultigresDatabaseResources` **must** set an `ownerReference` pointing to its corresponding `MultigresDatabase`. When the user deletes the `MultigresDatabase`, Kubernetes will automatically and correctly cascade-delete this `MultigresDatabaseResources` CR and its children (`TableGroup`, `Shard`). + +#### `CoreTemplate`, `CellTemplate`, `ShardTemplate` +* **Deletion Protection (Finalizer):** + 1. These templates are protected by a "fan-out" finalizer pattern. + 2. When a `MultigresCluster` or `MultigresDatabaseResources` controller *uses* a template (e.g., "prod-cell"), its reconcile loop **adds a finalizer** (e.g., `multigres.com/in-use-by: example-cluster`) *to the `CellTemplate` object*. + 3. When the `MultigresCluster` is deleted (or updated to no longer use "prod-cell"), its cleanup logic **removes its own finalizer** *from the `CellTemplate` object*. + 4. Kubernetes will only be able to delete the `CellTemplate` after all controllers that were using it have removed their finalizers. This prevents deleting a template that is actively in use. + +---- + +## End-User Examples + +### 1\. The Ultra-Minimalist (Relying on Namespace/Webhook Defaults) + +This user creates the smallest possible cluster and database. All values are defaulted by the operator's webhook or by templates named `default` in the namespace. + +**SRE applies the cluster:** + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresCluster +metadata: + name: minimal-cluster + namespace: platform-system +spec: + # All core components (globalTopoServer, multiadmin) + # will use the 'CoreTemplate' named 'default' in this + # namespace, or be defaulted by the webhook. + + # The system catalog will use the 'ShardTemplate' + # named 'default' or be defaulted by the webhook. + # It MUST be pinned to a cell. + systemCatalog: + overrides: + pools: + primary: + cell: "us-east-1a" + + # At least one cell is required. + cells: + - name: "us-east-1a" + zone: "us-east-1a" + # This cell will use the 'CellTemplate' named + # 'default' or be defaulted by the webhook. +``` + +**DBA applies the database (e.g., in `app-prod` namespace):** +This could also be auto-generated from a `CREATE DATABASE` DDL command. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresDatabase +metadata: + name: "analytics-db" + namespace: app-prod +spec: + clusterRef: + name: "minimal-cluster" + namespace: "platform-system" + databaseName: "analytics" + tablegroups: + - name: "events" + shards: + - name: "0" # A single logical shard + keyRange: + start: "0" + end: "inf" +``` + +**What happens:** + +1. The operator sees `analytics-db`. +2. It auto-creates `MultigresDatabaseResources` named `analytics-db-resources`. +3. It populates `defaultShardTemplate` by first looking for `minimal-cluster.templateDefaults.shardTemplate`. Since that's missing, it looks for a `ShardTemplate` named `default` in the `app-prod` namespace. If that's missing, it uses the operator's hardcoded default. +4. The DBA must edit `analytics-db-resources` to add the required `overrides` (like `cell`), or the operator will report an error. + +### 2\. The Minimalist (Relying on Cluster Defaults) + +This user relies on the `spec.templateDefaults` field to set cluster-wide defaults. + +**SRE applies the cluster:** + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresCluster +metadata: + name: minimal-cluster-with-defaults + namespace: platform-system +spec: + images: + imagePullPolicy: "IfNotPresent" + multigateway: "multigres/gateway:latest" + multiorch: "multigres/orch:latest" + multipooler: "multigres/pooler:latest" + multiadmin: "multigres/admin:latest" + postgres: "postgres:15.3" + + templateDefaults: + coreTemplate: "dev-defaults-core" + cellTemplate: "dev-defaults-cell" + # This is the cluster-wide default for DDL + shardTemplate: "dev-defaults-shard" + + globalTopoServer: + # Omitted, will use 'dev-defaults-core' template + multiadmin: + # Omitted, will use 'dev-defaults-core' template + + systemCatalog: + shardTemplate: "dev-defaults-shard" + overrides: + pools: + primary: + cell: "us-east-1a" + + cells: + - name: "us-east-1a" + zone: "us-east-1a" + # Omitted, will use 'dev-defaults-cell' template +``` + +**DBA applies the database (e.g., in `app-prod` namespace):** + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresDatabase +metadata: + name: "analytics-db" + namespace: app-prod +spec: + clusterRef: + name: "minimal-cluster-with-defaults" + namespace: "platform-system" + databaseName: "analytics" + tablegroups: + - name: "events" + shards: + - name: "0" + keyRange: + start: "0" + end: "inf" +``` + +**What happens:** + +1. Operator auto-creates `analytics-db-resources`. +2. It reads `minimal-cluster-with-defaults.templateDefaults.shardTemplate` and sets `analytics-db-resources.spec.defaultShardTemplate` to `"dev-defaults-shard"`. +3. The DBA still needs to edit `analytics-db-resources` to add the `overrides` for the `cell`. + +### 3\. The Power User (Explicit Overrides) + +The SRE provides the cluster. The DBA defines their logical DB, and the SRE (or a platform-savvy DBA) defines the physical resources explicitly. + +**DBA applies the logical database:** + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresDatabase +metadata: + name: "users-db" + namespace: app-prod +spec: + clusterRef: + name: "prod-cluster" + namespace: "platform-system" + databaseName: "users" + tablegroups: + - name: "users_tg" + shards: + - name: "0" + keyRange: + start: "0" + end: "100" + - name: "1" + keyRange: + start: "100" + end: "inf" +``` + +**SRE/DBA applies the physical resources:** + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresDatabaseResources +metadata: + name: "users-db-resources" + namespace: app-prod +spec: + databaseRef: "users-db" + # This SRE explicitly sets the default for this DB, + # overriding the cluster-wide default. + defaultShardTemplate: "standard-shard" + + tablegroups: + - name: "users_tg" + shards: + # Shard 0: Use an explicit template with overrides + - name: "0" + shardTemplate: "high-mem-shard" + overrides: + pools: + primary: + cell: "us-east-1a" + storage: + size: "500Gi" # Atomic override for storage + class: "io1" + + # Shard 1: Use the 'defaultShardTemplate' from this CR + - name: "1" + # No 'shardTemplate', uses 'defaultShardTemplate' ("standard-shard") + overrides: + pools: + primary: + cell: "us-west-2a" +``` + +## Implementation History + +* **2025-10-08:** Initial proposal to create individual, user-managed CRDs for each component (`MultiGateway`, `MultiOrch`, etc.). +* **2025-10-14:** A second proposal introduced a top-level `MultigresCluster` CR as the primary user-facing API. +* **2025-10-28:** The "parent/child" model was formalized, designating `MultigresCluster` as the single source of truth with read-only children. +* **2025-11-05:** Explored a simplified V1 API limited to a single shard. Rejected to ensure the API is ready for multi-shard from day one. +* **2025-11-06:** Explored a single "all-in-one" `DeploymentTemplate`. Rejected due to N:1 conflicts when trying to apply one template to both singular Cell components and multiplied Shard components. +* **2025-11-07:** Finalized the "Scoped Template" model (`CellTemplate` & `ShardTemplate`) and restored full explicit `database` -> `tablegroup` -> `shard` hierarchy. +* **2025-11-10:** Refactored `pools` to use Maps instead of Lists and introduced atomic grouping for `resources` and `storage` to ensure safer template overrides. +* **2025-11-11:** Introduced a consistent 4-level override chain (inline/explicit-template -> cluster-default -> namespace-default -> webhook) for all components. Added `CoreTemplate` CRD and `spec.templateDefaults` block to support this. Reverted `spec.coreComponents` nesting to top-level `globalTopoServer` and `multiadmin` fields. +* **2025-11-14:** Decoupled database definitions from the core cluster to support DDL-driven workflows and multi-tenancy. This involved removing `spec.databases` from `MultigresCluster` and introducing the new `MultigresDatabase` (logical "claim") and `MultigresDatabaseResources` (physical spec) CRDs. Added `spec.systemCatalog` to `MultigresCluster` to explicitly manage the implicit "default database" for metadata. diff --git a/plans/phase-1/api-design/multigres-operator-api-v1alpha1-design.md b/plans/phase-1/api-design/multigres-operator-api-v1alpha1-design.md new file mode 100644 index 00000000..3698ce18 --- /dev/null +++ b/plans/phase-1/api-design/multigres-operator-api-v1alpha1-design.md @@ -0,0 +1,1312 @@ +# MultigresCluster API v1alpha1 + +This is the final design we will be using to implement the operator based on discussions with the Multigres team. + +## Summary + +This proposal defines the `v1alpha1` API for the Multigres Operator. The design is centered on a root `MultigresCluster` resource that acts as the single source of truth, supported by three specifically scoped template resources: + +1. **`MultigresCluster`**: The root resource defining the desired state (intent) of the entire cluster. +2. **`CoreTemplate`**: A reusable, namespaced resource for defining standard configurations for core control plane components (`GlobalTopoServer` and `MultiAdmin`). +3. **`CellTemplate`**: A reusable, namespaced resource for defining standard configurations for Cell-level components (`MultiGateway` and optionally `LocalTopoServer`). +4. **`ShardTemplate`**: A reusable, namespaced resource for defining standard configurations for Shard-level components (`MultiOrch` and `Pools`). + +All other resources (`TopoServer`, `Cell`, `TableGroup`, `Shard`) should be considered read-only child CRs owned by the `MultigresCluster`. These child CRs reflect the *realized state* of the system and are managed by their own dedicated controllers. If the user edits them directly, they will get immediately reverted by the parent CR. + +## Motivation + +Managing a distributed, sharded database system across multiple failure domains is inherently complex. Previous iterations explored monolithic CRDs (too complex), purely composable CRDs (too manual), and "managed" flags (unstable state). + +The formalized parent/child model addresses these by ensuring: + + * **Separation of Concerns:** Splitting logic into child CRs results in simple, specialized controllers. + * **Single Source of Truth:** The `MultigresCluster` is the only editable entry point for cluster topology, preventing conflicting states. + * **Scoped Reusability:** By splitting templates into `CoreTemplate`, `CellTemplate`, and `ShardTemplate`, we provide clear, reusable configurations. + * **Explicit Topology:** Removing "shard count" partitioning in favor of explicit shard list definitions provides deterministic control over where exactly data lives. + * **Consistent Override Chain:** All components follow a predictable 4-level override chain, providing maximum flexibility while maintaining a clear and consistent API pattern. + +## Proposal: API Architecture and Resource Topology + + * **Core Components:** `MultiAdmin` and `GlobalTopoServer` are defined as top-level fields in the `MultigresCluster`. Each can be configured inline or by referencing a `CoreTemplate`. + * **Cells:** Explicitly defined in the root CR; can be specified inline or via a `CellTemplate`. + * **Databases:** Follows a strict physical hierarchy: `Database` -\> `TableGroup` -\> `Shard`. + * **Shards:** Can be specified inline or via a `ShardTemplate`. + * **Multiorch** Located at the **Shard** level to provide dedicated orchestration. + +```ascii +[MultigresCluster] πŸš€ (Root CR - user-editable) + β”‚ + β”œβ”€β”€ πŸ“ Defines [TemplateDefaults] (Cluster-wide default templates) + β”‚ + β”œβ”€β”€ 🌍 [GlobalTopoServer] (Child CR) ← πŸ“„ Uses [CoreTemplate] OR inline [spec] + β”‚ + β”œβ”€β”€ πŸ€– MultiAdmin Resources ← πŸ“„ Uses [CoreTemplate] OR inline [spec] + β”‚ + β”œβ”€β”€ πŸ’  [Cell] (Child CR) ← πŸ“„ Uses [CellTemplate] OR inline [spec] + β”‚ β”‚ + β”‚ β”œβ”€β”€ πŸšͺ MultiGateway Resources + β”‚ └── πŸ“‘ [LocalTopoServer] (Child CR, optional) + β”‚ + └── πŸ—ƒοΈ [TableGroup] (Child CR) + β”‚ + └── πŸ“¦ [Shard] (Child CR) ← πŸ“„ Uses [ShardTemplate] OR inline [spec] + β”‚ + β”œβ”€β”€ 🧠 MultiOrch Resources (Deployment/Pod) + └── 🏊 Pools (StatefulSets for Postgres+MultiPooler) + +πŸ“„ [CoreTemplate] (User-editable, scoped config) + β”œβ”€β”€ globalTopoServer + └── multiadmin + +πŸ“„ [CellTemplate] (User-editable, scoped config) + β”œβ”€β”€ multigateway + └── localTopoServer (optional) + +πŸ“„ [ShardTemplate] (User-editable, scoped config) + β”œβ”€β”€ multiorch + └── pools (postgres + multipooler) +``` + +## Design Details: API Specification + +### User Managed CR: MultigresCluster + + * This CR and the three scoped templates (`CoreTemplate`, `CellTemplate`, `ShardTemplate`) are the *only* editable entries for the end-user. + * All other child CRs will be owned by this top-level CR. Any manual changes to those child CRs will be immediately reverted by the `MultigresCluster` cluster controller. + * All component configurations (`globalTopoServer`, `multiadmin`, `cells`, `shards`) follow a consistent pattern: they can be defined via an inline `spec` or by referencing a template (`templateRef`). Providing both is a validation error. + * **4-level Override Chain:** All components use the following 4-level precedence chain for configuration: + 1. **Component-Level Definition:** An inline `spec` or an explicit `templateRef` on the component itself. + 2. **Defaults in MultigresCluster spec:** The corresponding template defined in `spec.templateDefaults` (e.g., `templateDefaults.coreTemplate` or `templateDefaults.cellTemplate`). + 3. **Namespace-Level Default:** A template of the correct kind (e.g., `CoreTemplate`) named `default` in the same namespace. + 4. **Operator Hardcoded Defaults:** A final fallback applied by the operator's admission webhook. + * **Atomic Overrides:** To ensure safety, highly interdependent fields are grouped (e.g., `resources`, `storage`). When using `overrides`, you must replace the *entire* group, not just individual sub-fields (e.g., you cannot override just `cpu limit` without also providing `cpu request`). + * Images are defined globally to avoid the danger of running multiple incongruent versions at once. This implies the operator handles upgrades. + * The `MultigresCluster` does not create its grandchildren directly; for example, shard configuration is passed to the `TableGroup` CR, which then creates its own children `Shard` CRs. + +> Note: This document prioritizes API specification accuracy. The child CR and template examples are for illustrative purposes and may not perfectly reflect the exact output generated by the provided `MultigresCluster` configuration. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresCluster +metadata: + name: example-cluster + namespace: example +spec: + # ---------------------------------------------------------------- + # Global Images Cluster Configuration + # ---------------------------------------------------------------- + # Images are defined globally to ensure version consistency across + # all cells and shards. + # NOTE: Perhaps one day we can template these as well. + images: + imagePullPolicy: "IfNotPresent" + imagePullSecrets: + - name: "my-registry-secret" + multigateway: "multigres/multigres:latest" + multiorch: "multigres/multigres:latest" + multipooler: "multigres/multigres:latest" + multiadmin: "multigres/multigres:latest" + postgres: "postgres:15.3" + + # ---------------------------------------------------------------- + # Cluster-Level Template Defaults + # ---------------------------------------------------------------- + # These are the default templates to use for any component that + # does not have an inline 'spec' or an explicit 'templateRef'. + # These are optional. + # If these defaults are not specified, the controller will pick whichever is + # named 'default' in the namespace, or use the controller defaults. + templateDefaults: + coreTemplate: "cluster-wide-core" + cellTemplate: "cluster-wide-cell" + shardTemplate: "cluster-wide-shard" + + # ---------------------------------------------------------------- + # Global Components + # ---------------------------------------------------------------- + + # Global TopoServer is a singleton. It follows the 4-level override chain. + # It supports EITHER a 'etcd' OR 'external' OR 'templateRef'. + globalTopoServer: + # --- OPTION 1: Inline Etcd Spec --- + etcd: + image: "quay.io/coreos/etcd:v3.5" + replicas: 3 + storage: + size: "10Gi" + class: "standard-gp3" + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" + + # --- OPTION 2: Inline External Spec --- + # external: + # endpoints: + # - "https://etcd-1.infra.local:2379" + # - "https://etcd-2.infra.local:2379" + # caSecret: "etcd-ca-secret" + # clientCertSecret: "etcd-client-cert-secret" + + # --- OPTION 3: Explicit Template Reference --- + # templateRef: "my-explicit-core-template" + + # MultiAdmin is a singleton. It follows the 4-level override chain. + # It supports EITHER a 'spec' OR a 'templateRef'. + multiadmin: + # --- OPTION 1: Inline Spec --- + spec: + replicas: 2 + resources: + requests: + cpu: "200m" + memory: "256Mi" + limits: + cpu: "500m" + memory: "512Mi" + # Affinity can be configured too by user if desired + # affinity: + # podAntiAffinity: + # preferredDuringSchedulingIgnoredDuringExecution: + # - weight: 100 + # podAffinityTerm: + # labelSelector: + # matchLabels: + # app.kubernetes.io/component: multiadmin + # topologyKey: "kubernetes.io/hostname" + + # --- OPTION 2: Explicit Template Reference --- + # templateRef: "my-explicit-core-template" + + # ---------------------------------------------------------------- + # Cells Configuration + # ---------------------------------------------------------------- + cells: + # --- CELL 1: Using an Explicit Template --- + - name: "us-east-1a" + # Location must be strictly one of: 'zone' OR 'region' + zone: "us-east-1a" + # region: "us-east-1" # use instead of zone + + cellTemplate: "standard-cell-ha" + + # Optional overrides applied ON TOP of the template. + overrides: + multigateway: + replicas: 3 + + # --- CELL 2: Using Inline Spec (No Template) --- + - name: "us-east-1b" + zone: "us-east-1b" + spec: + multigateway: + replicas: 2 + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" + # --- Optional Local TopoServer --- + # If omitted, this cell uses the GlobalTopoServer. + # localTopoServer: + # etcd: + # image: "quay.io/coreos/etcd:v3.5" + # replicas: 3 + # storage: + # size: "5Gi" + # class: "standard-gp3" + + # --- CELL 3: Using Cluster Default Template --- + - name: "us-east-1c" + zone: "us-east-1c" + # 'spec' and 'cellTemplate' are omitted. + # This will use 'spec.templateDefaults.cellTemplate' ("cluster-wide-cell") + # (if that is not set, it will look for 'CellTemplate' named 'default', + # and if that is not found, it will use the webhook default). + + # ---------------------------------------------------------------- + # Database Topology (Database -> TableGroup -> Shard) + # ---------------------------------------------------------------- + databases: + # --- EXAMPLE 1: Configuring the System Default Database --- + # This entry targets the system-level database created during bootstrap. + # We mark it as 'default: true' to apply this configuration to the + # bootstrap resources (instead of creating a new user database). + - name: "postgres" + default: true + tablegroups: + - name: "default" + default: true + shards: + - name: "0" + # define resources for the system default shard + shardTemplate: "standard-shard-ha" + + # --- EXAMPLE 2: A User Database --- + - name: "production_db" + # default: false (Implicit) - This creates a new logical database + tablegroups: + # The default TableGroup for this specific database. + # Note that the default TableGroup can only have one shard (i.e. unsharded). + # It handles all tables not explicitly moved to 'orders_tg' in this example. + - name: "main_unsharded" + default: true + shards: + - name: "0" + spec: + multiorch: + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + pools: + main-app: + type: "readWrite" + cells: + - "us-east-1b" + replicasPerCell: 2 + storage: + size: "100Gi" + class: "standard-gp3" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" + + # A custom sharded group for high-volume data + - name: "orders_tg" + # default: false (Implicit) + shards: + # --- SHARD 0: Using Inline Spec (No Template) --- + - name: "0" + spec: + multiorch: + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + pools: + main-app: + type: "readWrite" + cells: + - "us-east-1b" + replicasPerCell: 2 + storage: + size: "100Gi" + class: "standard-gp3" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" + + # --- SHARD 1: Using an Explicit Template --- + - name: "1" + shardTemplate: "standard-shard-ha" + # Overrides are crucial here to pin pools to specific cells + # if the template uses generic cell names. + overrides: + # MAP STRUCTURE: Keyed by pool name for safe targeting. + pools: + # Overriding the pool named 'main-app' from the template + # to ensure it lives in a specific cell for this shard. + main-app: + cells: + - "us-east-1a" + + # --- SHARD 2: Using Cluster Default Template --- + - name: "2" + # 'spec' and 'shardTemplate' are omitted. + # This will use 'spec.templateDefaults.shardTemplate' ("cluster-wide-shard") + # (or 'default' template, or webhook default). + # We still must provide overrides for required fields. + overrides: + pools: + main-app: + cells: + - "us-east-1c" + +status: + observedGeneration: 1 + conditions: + - type: Available + status: "True" + lastTransitionTime: "2025-11-07T12:00:00Z" + message: "All components are available." + # Aggregated status for high-level visibility + cells: + us-east-1a: + ready: True + gatewayReplicas: 3 + us-east-1b: + ready: True + gatewayReplicas: 2 + us-east-1c: + ready: True + gatewayReplicas: 2 # (Assuming default is 2) + databases: + production_db: + readyShards: 3 + totalShards: 3 +``` + +### User Managed CR: CoreTemplate + + * This CR is NOT a child resource. It is purely a configuration object. + * It is namespaced to support RBAC scoping (e.g., platform team owns templates, dev team owns clusters). + * It defines the shape of the cluster's core control plane, which comprises Global Topo Server and MultiAdmin. A `CoreTemplate` can contain definitions for *both* components. When a component (e.g., `globalTopoServer`) references this template, the controller will extract the relevant section. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: CoreTemplate +metadata: + name: "standard-core-ha" + namespace: example +spec: + # Defines the Global TopoServer component + globalTopoServer: + # --- OPTION 1: Managed by Operator --- + etcd: + image: "quay.io/coreos/etcd:v3.5" + replicas: 3 + storage: + size: "10Gi" + class: "standard-gp3" + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" + # --- ALTERNATIVE OPTION 2: External Etcd --- + # external: + # endpoints: + # - "https://etcd-1.infra.local:2379" + # caSecret: "etcd-ca-secret" + # clientCertSecret: "etcd-client-cert-secret" + + # Defines the MultiAdmin component + multiadmin: + spec: + replicas: 2 + resources: + requests: + cpu: "200m" + memory: "256Mi" + limits: + cpu: "500m" + memory: "512Mi" + # Affinity can be configured too by user if desired + # affinity: + # podAntiAffinity: + # preferredDuringSchedulingIgnoredDuringExecution: + # - weight: 100 + # podAffinityTerm: + # labelSelector: + # matchLabels: + # app.kubernetes.io/component: multiadmin + # topologyKey: "kubernetes.io/hostname" +``` + +### User Managed CR: CellTemplate + + * This CR is NOT a child resource. It is purely a configuration object. + * It is namespaced to support RBAC scoping (e.g., platform team owns templates, dev team owns clusters). + * When created, templates are not reconciled until referenced by a `MultigresCluster`. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: CellTemplate +metadata: + name: "standard-cell-ha" + namespace: example +spec: + # Template strictly defines only Cell-scoped components. + multigateway: + replicas: 2 + resources: + requests: + cpu: "500m" + memory: "512Mi" + limits: + cpu: "1" + memory: "1Gi" + affinity: + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchLabels: + app.kubernetes.io/component: multigateway + topologyKey: "kubernetes.io/hostname" + + # --- OPTIONAL: Local TopoServer --- + # Define if cells using this template should have their own dedicated etcd. + # If omitted, cells use the GlobalTopoServer by default. + # + # localTopoServer: + # etcd: + # image: "quay.io/coreos/etcd:v3.5" + # replicas: 3 + # storage: + # class: "standard-gp3" + # size: "5Gi" + # resources: + # requests: + # cpu: "500m" + # memory: "1Gi" + # limits: + # cpu: "1" + # memory: "2Gi" +``` + +### User Managed CR: ShardTemplate + +* This CR is NOT a child resource. It is purely a configuration object. +* It is namespaced to support RBAC scoping (e.g., platform team owns templates, dev team owns clusters). +* When created, templates are not reconciled until referenced by a `MultigresCluster`. +* Similar to `CellTemplate`, this is a pure configuration object. It defines the "shape" of a shard: its orchestration and its data pools. +* **`pools` is a MAP, keyed by the pool name:** Using a map structure ensures that overrides are resilient to changes in the underlying template; unlike list arrays where inserting or reordering items shifts indicesβ€”potentially causing an override targeting "index 1" to accidentally apply to the wrong pool if the template order changesβ€”keyed maps guarantee that an override for a specific pool (e.g., `main-app`) always targets that exact logical resource, regardless of how other pools are added or organized in the template. +* **MultiOrch Placement:** `multiorch` is deployed to the cells listed in `multiorch.cells`. If this list is empty or omitted, it defaults to all cells where pools are defined. +* **Pool Placement:** `pools` uses a `cells` list. For `readWrite` pools, this list typically contains only a few cells rather than using all available cells. For `readOnly` pools, this list can contain multiple cells to apply the same configuration across multiple zones and regions. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: ShardTemplate +metadata: + name: "standard-shard-ha" + namespace: example +spec: + # Template strictly defines only Shard-scoped components. + + # MultiOrch is a shard-level component. + # The Operator will deploy one instance of this Deployment into EVERY Cell + # listed in 'cells'. If 'cells' is empty, it defaults to all cells + # where pools are defined. + multiorch: + # replicas: 1 # replicas per cell and pool this multiorch is deployed + cells: [] # Defaults to all populated cells + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + + # MAP STRUCTURE: Keyed by pool name for safe targeting. + pools: + main-app: + type: "readWrite" + replicasPerCell: 2 + storage: + class: "gp3" + size: "100Gi" + postgres: + resources: + requests: + cpu: "1" + memory: "2Gi" + limits: + cpu: "2" + memory: "4Gi" + multipooler: + resources: + requests: + cpu: "500m" + memory: "512Mi" + limits: + cpu: "1" + memory: "1Gi" + # Affinity is optional. + # The Operator automatically applies zone-spreading based on the Cell definition. + # This field allows adding EXTRA constraints (e.g., specific node types). + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: disktype + operator: In + values: + - ssd + dr-replica: + type: "readOnly" + # This pool will be deployed to all cells listed here. + cells: + - "us-west-2a" + replicasPerCell: 1 + storage: + class: "standard-gp3" + size: "100Gi" + postgres: + resources: + requests: + cpu: "1" + memory: "2Gi" + limits: + cpu: "2" + memory: "4Gi" + multipooler: + resources: + requests: + cpu: "500m" + memory: "512Mi" + limits: + cpu: "1" + memory: "1Gi" +``` + +### The `default` Flag (System Resources) + +Both `Database` and `TableGroup` entries support a boolean `default` flag (defaulting to `false`). This flag maps the definition to the **System Default** infrastructure created during the cluster's bootstrap phase. + + * **On a Database (`default: true`):** Indicates this entry defines the configuration for the system-level database (typically named `postgres`) that contains the global catalog. There can be only one default database per cluster. + * **On a TableGroup (`default: true`):** Indicates this entry defines the configuration for the "Catch-All" or "Unsharded" group of that database. Every database has exactly one Default TableGroup where tables land by default. + +Defining these entries allows the user to explicitly configure the resources (replicas, storage, compute) allocated to these system components, rather than relying on hardcoded operator defaults. + +### Child Resources (Read-Only) + +These resources are created and reconciled by the `MultigresCluster` controller. + +> NOTE: At some point we may want to consider adding status fields for the children to say what template the config is coming from, for simplicity not defining that now. + +#### Child CR: TopoServer + + * Applies to both Global (owned by `MultigresCluster`) and Local (owned by `Cell`) topology servers. + * This CR does *not* exist if a separate, `external` etcd definition is used in the parent (e.g. using etcd-operator to provision one). + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: TopoServer +metadata: + name: "example-cluster-global" + namespace: example + ownerReferences: + - apiVersion: multigres.com/v1alpha1 + kind: MultigresCluster + name: "example-cluster" + controller: true +spec: + # Resolved from MultigresCluster. + replicas: 3 + storage: + size: "10Gi" + class: "standard-gp3" + image: "quay.io/coreos/etcd:v3.5" + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1" + memory: "2Gi" +status: + conditions: + - type: Available + status: "True" + lastTransitionTime: "2025-11-07T12:01:00Z" + clientService: "example-cluster-global-client" + peerService: "example-cluster-global-peer" +``` + +#### Child CR: Cell + + * Owned by `MultigresCluster`. + * Strictly contains `MultiGateway` and optional `LocalTopoServer`. + * The `allCells` field is used for discovery by gateways. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: Cell +metadata: + name: "example-cluster-us-east-1a" + namespace: example + labels: + multigres.com/cluster: "example-cluster" + multigres.com/cell: "us-east-1a" + ownerReferences: + - apiVersion: multigres.com/v1alpha1 + kind: MultigresCluster + name: "example-cluster" + uid: "a1b2c3d4-1234-5678-90ab-f0e1d2c3b4a5" + controller: true +spec: + name: "us-east-1a" + zone: "us-east-1a" # this would be region if that was chosen instead. + + # Image passed down from global configuration + multigatewayImage: "multigres/multigres:latest" + + # Resolved from CellTemplate + Overrides + multigateway: + replicas: 3 + resources: + requests: + cpu: "500m" + memory: "512Mi" + limits: + cpu: "1" + memory: "1Gi" + affinity: + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchLabels: + app.kubernetes.io/component: multigateway + topologyKey: "kubernetes.io/hostname" + + # A reference to the GLOBAL TopoServer. + # Always populated by the parent controller if no local server is used. + globalTopoServer: + address: "example-cluster-global-client.example.svc.cluster.local:2379" + rootPath: "/multigres/global" + implementation: "etcd2" + + # Option 1: Using the Global TopoServer (Default) + # + topoServer: {} + + # Option 2: Inline Definition (External) + # + # topoServer: + # external: + # address: "my-etcd.some-namespace.svc.cluster.local:2379" + # rootPath: "/multigres/us-east-1a" + # implementation: "etcd2" + + # Option 3: Managed Local + # + # topoServer: + # etcd: + # rootPath: "/multigres/us-east-1a" + # image: "quay.io/coreos/etcd:v3.5" + # replicas: 3 + # storage: + # size: "5Gi" + # class: "standard-gp3" + + # List of all cells in the cluster for discovery. + allCells: + - "us-east-1a" + - "us-east-1b" + + # Topology flags for the Cell controller to act on. + topologyReconciliation: + registerCell: true + pruneTablets: true + +status: + conditions: + - type: Available + status: "True" + lastTransitionTime: "2025-11-07T12:01:00Z" + gatewayReplicas: 3 + gatewayReadyReplicas: 3 + gatewayServiceName: "example-cluster-us-east-1a-gateway" +``` + +#### Child CR: TableGroup + + * Owned by `MultigresCluster`. + * Acts as the middle-manager for Shards. It MUST contain the fully resolved specification for all shards it manages, enabling it to be the single source of truth for creating its child `Shard` CRs. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: TableGroup +metadata: + name: "example-cluster-production-db-orders-tg" + namespace: example + labels: + multigres.com/cluster: "example-cluster" + multigres.com/database: "production_db" + multigres.com/tablegroup: "orders_tg" + ownerReferences: + - apiVersion: multigres.com/v1alpha1 + kind: MultigresCluster + name: "example-cluster" + controller: true +spec: + databaseName: "production_db" + tableGroupName: "orders_tg" + default: false + + # Images passed down from global configuration + images: + multiorch: "multigres/multigres:latest" + multipooler: "multigres/multigres:latest" + postgres: "postgres:15.3" + + # A reference to the GLOBAL TopoServer. + globalTopoServer: + address: "example-cluster-global-client.example.svc.cluster.local:2379" + rootPath: "/multigres/global" + implementation: "etcd2" + + # The list of FULLY RESOLVED shard specifications. + # This is pushed down from the MultigresCluster controller. + shards: + - name: "0" + multiorch: + cells: + - "us-east-1a" + - "us-east-1b" + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + pools: + main-app: + cells: + - "us-east-1a" + type: "readWrite" + replicasPerCell: 2 + storage: + size: "100Gi" + class: "standard-gp3" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "1" + memory: "512Mi" + limits: + cpu: "2" + memory: "1Gi" + + - name: "1" + multiorch: + cells: + - "us-east-1b" + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + pools: + main-app: + cells: + - "us-east-1b" + type: "readWrite" + replicasPerCell: 2 + storage: + size: "100Gi" + class: "standard-gp3" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "1" + memory: "512Mi" + limits: + cpu: "2" + memory: "1Gi" + + # This shard's spec is resolved from a template + # (e.g., "cluster-wide-shard") + - name: "2" + multiorch: + cells: + - "us-east-1c" + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + pools: + main-app: + cells: + - "us-east-1c" # Resolved from override + type: "readWrite" + replicasPerCell: 2 + storage: + size: "100Gi" + class: "standard-gp3" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "1" + memory: "512Mi" + limits: + cpu: "2" + memory: "1Gi" + +status: + readyShards: 3 + totalShards: 3 +``` + +#### Child CR: Shard + + * Owned by `TableGroup`. + * Contains `MultiOrch` (consensus management) and `Pools` (actual data nodes). + * Represents one entry from the `TableGroup` shards list. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: Shard +metadata: + name: "example-cluster-production-db-orders-tg-0" + namespace: example + labels: + multigres.com/cluster: "example-cluster" + multigres.com/database: "production_db" + multigres.com/tablegroup: "orders_tg" + multigres.com/shard: "0" + ownerReferences: + - apiVersion: multigres.com/v1alpha1 + kind: TableGroup + name: "example-cluster-production-db-orders-tg" + controller: true +spec: + databaseName: "production_db" + tableGroupName: "orders_tg" + shardName: "0" + + # Images passed down from global configuration + images: + multiorch: "multigres/multigres:latest" + multipooler: "multigres/multigres:latest" + postgres: "postgres:15.3" + + # A reference to the GLOBAL TopoServer. + globalTopoServer: + address: "example-cluster-global-client.example.svc.cluster.local:2379" + rootPath: "/multigres/global" + implementation: "etcd2" + + # Fully resolved from parent TableGroup spec + multiorch: + cells: + - "us-east-1a" + resources: + requests: + cpu: "100m" + memory: "128Mi" + limits: + cpu: "200m" + memory: "256Mi" + + pools: + main-app: + cells: + - "us-east-1a" + type: "readWrite" + replicasPerCell: 2 + storage: + size: "100Gi" + class: "standard-gp3" + postgres: + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" + multipooler: + resources: + requests: + cpu: "1" + memory: "512Mi" + limits: + cpu: "2" + memory: "1Gi" +status: + cells: + - "us-east-1a" + orchReady: True + poolsReady: True +``` + +## Defaults & Webhooks + +To simplify user experience and ensure cluster stability, the operator uses a combination of **Mutating Webhooks** (for applying defaults) and **Validating Webhooks** (for synchronous checks), alongside **Controller Finalizers** (for asynchronous protection). + +### 1\. Configuration Defaults (Mutating Webhook) + +A mutating admission webhook applies a strict **4-Level Override Chain** to resolve configurations. This logic is applied consistently for Cluster Components (`GlobalTopoServer`, `MultiAdmin`, `Cells`) and Database Shards. + +#### Cluster Component Override Chain + +*Applies to `GlobalTopoServer`, `MultiAdmin`, and `Cells` defined in `MultigresCluster`.* + +1. **Component-Level Definition (Highest):** An inline `spec` or an explicit `templateRef` / `cellTemplate`, along with optional `overrides` on the component itself. +2. **Cluster-Level Default Template:** The corresponding template defined in `spec.templateDefaults` (e.g., `templateDefaults.coreTemplate` or `templateDefaults.cellTemplate`). +3. **Namespace-Level Default Template:** A template of the correct kind (e.g., `CoreTemplate`) named `default` in the same namespace as the cluster. +4. **Operator Hardcoded Defaults (Lowest):** A final fallback applied by the operator code (e.g., default resources, default replicas). + +#### Shard Override Chain + +*Applies to every Shard defined in `spec.databases[].tablegroups[].shards[]`.* + +1. **Shard-Level Definition (Highest):** An inline `spec` or an explicit `shardTemplate` defined on the specific shard entry in the `MultigresCluster` YAML. +2. **Cluster-Level Default Template:** The `spec.templateDefaults.shardTemplate` field in the root `MultigresCluster` CR. +3. **Namespace-Level Default Template:** A `ShardTemplate` named `default` in the same namespace as the cluster. +4. **Operator Hardcoded Defaults (Lowest):** A final fallback applied by the operator. +5. **List Replacement Behavior:** When overriding the `cells` field (in pools or multiorch), the new list specified in the override *completely replaces* the list defined in the template. It is not merged or appended. +----- + +### 2\. Synchronous Validating Webhooks + +Webhooks are used *only* for fast, synchronous, and semantic validation to prevent invalid configurations from being accepted by the API server. + +#### `MultigresCluster` + + * **On `CREATE` and `UPDATE`:** + * **Template Existence:** Validates that all templates referenced in `spec.templateDefaults` or explicitly in components (Core, Cell, or Shard templates) exist *at the time of application*. + * **Component Spec Mutex:** Enforces that `etcd`, `external`, and `templateRef` are mutually exclusive for components like `GlobalTopoServer`. + * **Uniqueness:** Validates that all names are unique within their respective scopes: + * Cell names in `spec.cells`. + * Database names in `spec.databases`. + * TableGroup names within a Database. + * Shard names within a TableGroup. + * **Topology Integrity:** Verifies that if a Shard is pinned to a specific cell (via overrides), that cell exists in the `spec.cells` list. + * **Name Length Safety:** Validates that the combined length of `ClusterName` + `DatabaseName` + `TableGroupName` does not exceed **50 characters**. + * *Reasoning:* These names are concatenated to form the TableGroup and Shard names. Kubernetes labels and StatefulSet service names have a strict 63-character limit. Enforcing this limit early prevents downstream deployment failures (e.g., `example-cluster-production-db-orders-tg-0-main-app` must be < 63 chars). + +----- + +### 3. Scheduling and Placement (Affinity) + +The Operator enforces High Availability (HA) by default while allowing user customization. + +1. **Operator-Enforced Placement (Mandatory):** + * The Operator automatically injects `nodeSelector` or `nodeAffinity` rules to ensure Pods scheduled for a specific `Cell` land on nodes belonging to that Cell (e.g., `topology.kubernetes.io/zone: us-east-1a`). + * The Operator automatically injects `podAntiAffinity` to spread replicas of the same Shard/Pool across different nodes within that Cell to survive node failures. + +2. **User-Defined Constraints (Additive):** + * Users can define `affinity` and `tolerations` in the `PoolSpec` (either in the Template or via Overrides). + * **Merge Logic:** User constraints are **appended** to the Operator's mandatory constraints. This allows users to restrict scheduling further (e.g., "Only run on nodes with label `disktype=ssd`") but prevents them from violating the fundamental Cell topology (e.g., you cannot schedule a "us-east-1a" pod onto a "us-west-2b" node). + +### 4. Asynchronous Controller and Finalizer Logic + +Asynchronous logic is used for operations that depend on external state or require blocking deletion, handled by controllers and finalizers. + +#### `MultigresCluster` + +* **Deletion Protection (Finalizer):** + 1. The `MultigresCluster` controller adds a finalizer (e.g., `multigres.com/finalizer`) to the CR upon creation. + 2. **Logic:** When the user deletes the cluster, the finalizer blocks deletion until all Child CRs (`Cell`, `Shard`, `TopoServer`) have been successfully deleted and confirmed gone by the API server. This ensures no orphaned resources (like expensive cloud load balancers or PVCs) are left behind. + +#### `CoreTemplate`, `CellTemplate`, `ShardTemplate` + +* **In-Use Protection (Validating Webhook):** + 1. **Mechanism:** A Validating Webhook intercepts all `DELETE` operations on Template resources. + 2. **Logic:** + * The webhook queries for any `MultigresCluster` resources that reference the template being deleted. + * To make this query efficient, the `MultigresCluster` controller must apply **Tracking Labels** to the Cluster CR whenever a template is referenced (e.g., `multigres.com/cell-template: prod-small`). + * If the query returns any results (meaning the template is in use), the webhook **rejects** the deletion request with a `403 Forbidden` error. + 3. **Benefit:** This prevents the Template from ever entering a "Terminating" state, ensuring it remains fully editable and active even if a user accidentally tries to delete it. + 4. **Race Condition Handling (Fail Safe):** In the rare race condition where a template is deleted immediately after a cluster is created (before the controller applies tracking labels), the Cluster Controller handles the missing template gracefully by setting the Cluster Status Condition to `Ready=False` (Reason: `TemplateMissing`) and pausing reconciliation. This ensures no data loss or configuration drift occurs; the operator simply waits for the user to restore the template or update the reference. + +## End-User Examples + +### 1\. The Ultra-Minimalist (Relying on Namespace/Webhook Defaults) + +This creates Multigres cluster with one cell and one database, one tablegroup and one shard. + +The defaults for this ultra-minimalistic example can be fetched in two ways: + +1. All components are defaulted by the operator's webhook. +2. If a `CoreTemplate`, `CellTemplate`, and `ShardTemplate` named `default` exists within the same namespace it will take these as its default values. + +> Notice that the `cells` field is still necessary but we are not naming the cell, this is because we are not sure yet if we should take a default zone or region at random from the cluster to define this, but if we can do this safely this field also won't be needed + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresCluster +metadata: + name: minimal-cluster +spec: + cells: + - zone: "us-east-1a" +``` + +When the user does a `kubectl get multigrescluster minimal-cluster -o yaml` after apply this they would see all the values materialized, the default will be applied via webhook. + +### 2\. The Minimalist (Relying on Cluster Defaults) + +This user relies on the `spec.templateDefaults` field to set cluster-wide defaults. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresCluster +metadata: + name: minimal-cluster-with-defaults +spec: + # This cluster will use the "dev-defaults" CoreTemplate for its + # global components. All cells and shards will use their respective + # "dev-defaults" templates. + templateDefaults: + coreTemplate: "dev-defaults" + cellTemplate: "dev-defaults-cell" + shardTemplate: "dev-defaults-shard" + + # CoreComponents (globalTopoServer, multiadmin) are omitted, + # so they will use "dev-defaults" (from the CoreTemplate) + + cells: + - name: "us-east-1a" + zone: "us-east-1a" + # 'spec' and 'cellTemplate' are omitted, so "dev-defaults-cell" is used. + - name: "us-west-2a" + zone: "us-west-2a" + # 'spec' and 'cellTemplate' are omitted, so "dev-defaults-cell" is used. + + databases: + - name: "db1" + tablegroups: + - name: "tg1" + shards: + - name: "0" + # 'spec' and 'shardTemplate' are omitted, so "dev-defaults-shard" is used. + # We still need to override the 'cell' for the main-app pool. + overrides: + pools: + main-app: + cells: + - "us-east-1a" + - name: "1" + # 'spec' and 'shardTemplate' are omitted, so "dev-defaults-shard" is used. + overrides: + pools: + main-app: + cells: + - "us-west-2a" +``` + +### 3\. The Power User (Explicit Overrides) + +This user explicitly defines everything, mixing inline specs and templates, and bypassing all defaults. + +```yaml +apiVersion: multigres.com/v1alpha1 +kind: MultigresCluster +metadata: + name: power-cluster +spec: + # This user sets cluster defaults, but overrides them everywhere. + templateDefaults: + coreTemplate: "cluster-default-core" + cellTemplate: "cluster-default-cell" + shardTemplate: "cluster-default-shard" + + globalTopoServer: + external: + endpoints: + - "https://my-etcd-1.infra:2379" + - "https://my-etcd-2.infra:2379" + - "https://my-etcd-3.infra:2379" + caSecret: "etcd-ca" + clientCertSecret: "etcd-client-cert" + + multiadmin: + spec: + replicas: 1 + resources: + requests: + cpu: "100m" + memory: "256Mi" + limits: + cpu: "200m" + memory: "512Mi" + affinity: + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchLabels: + app.kubernetes.io/component: multiadmin + topologyKey: "kubernetes.io/hostname" + + cells: + - name: "us-east-1a" + zone: "us-east-1a" + cellTemplate: "high-throughput-gateway" + - name: "us-west-2a" + zone: "us-west-2a" + cellTemplate: "standard-gateway" + overrides: + multigateway: + # Overriding the entire resources block + resources: + requests: + cpu: "500m" + memory: "2Gi" + limits: + cpu: "1" + memory: "4Gi" + # Affinity can be configured too by user if desired + # affinity: + # podAntiAffinity: + # preferredDuringSchedulingIgnoredDuringExecution: + # - weight: 100 + # podAffinityTerm: + # labelSelector: + # matchLabels: + # app.kubernetes.io/component: multiadmin + # topologyKey: "kubernetes.io/hostname" + + databases: + - name: "users_db" + tablegroups: + - name: "auth" + shards: + - name: "0" + shardTemplate: "geo-distributed-shard" + overrides: + # MAP-BASED OVERRIDE: Safely targeting 'main-app' pool + pools: + main-app: + # Partial override of a simple field + cells: + - "us-east-1a" + # OVERRIDE of Postgres compute for this specific shard + postgres: + requests: + cpu: "8" + memory: "16Gi" + limits: + cpu: "8" + memory: "16Gi" + - name: "1" + shardTemplate: "geo-distributed-shard" + overrides: + pools: + main-app: + cells: + - "us-west-2a" +``` + +## Implementation History + + * **2025-10-08:** Initial proposal to create individual, user-managed CRDs for each component (`MultiGateway`, `MultiOrch`, etc.). + * **2025-10-14:** A second proposal introduced a top-level `MultigresCluster` CR as the primary user-facing API. + * **2025-10-28:** The current "parent/child" model was formalized, designating `MultigresCluster` as the single source of truth with read-only children. + * **2025-11-05:** Explored a simplified V1 API limited to a single shard. Rejected to ensure the API is ready for multi-shard from day one. + * **2025-11-06:** Explored a single "all-in-one" `DeploymentTemplate`. Rejected due to N:1 conflicts when trying to apply one template to both singular Cell components and multiplied Shard components. + * **2025-11-07:** Finalized the "Scoped Template" model (`CellTemplate` & `ShardTemplate`) and restored full explicit `database` -\> `tablegroup` -\> `shard` hierarchy. + * **2025-11-10:** Refactored `pools` to use Maps instead of Lists and introduced atomic grouping for `resources` and `storage` to ensure safer template overrides. + * **2025-11-11:** Introduced a consistent 4-level override chain (inline/explicit-template -\> cluster-default -\> namespace-default -\> webhook) for all components. Added `CoreTemplate` CRD and `spec.templateDefaults` block to support this. Reverted `spec.coreComponents` nesting to top-level `globalTopoServer` and `multiadmin` fields. + * **2025-11-14:** Explored a multi-CRD "claim" model (`MultigresDatabase` + `MultigresDatabaseResources`) to support DDL-driven workflows and strong RBAC separation. + * **2025-11-18:** Reverted to the 2025-11-11 monolithic `MultigresCluster` design to align with client requirements for Postgres-native "System Catalog" management. The `MultigresDatabase` CRD was rejected. We will instead support the DDL workflow via a synchronization controller that patches the monolithic `MultigresCluster` CR based on the `SystemCatalog` state. + * **2025-11-25:** Moved `multiorch` and `pool` placement from single `cell` fields to explicit `cells` lists. This supports multi-cell pool definitions (e.g., for uniform read replicas) and decouples MultiOrch placement from pool presence, while maintaining safety via a "defaults to all" logic. + +## Drawbacks + +The requirement to create resources via DDL (`CREATE DATABASE`) to provide Postgres native interface was a part of the design discussions. This has not been dropped, but postponed. What follows are some caveats when following the current design as it stands right now and also incorporating this Posgres compatibility requirement: + + * **Broken "Delete" UX (Zombie Resources):** To support the "DB is Truth" requirement, the Operator should not delete a database simply because it is removed from the `spec.databases` list as it might still exist in the default database schema ("system catalog"). This breaks the standard Kubernetes expectation that "deleting config = deleting resource." Users must use imperative SQL (`DROP DATABASE`) to delete resources; removing them from Git will only orphan them, leaving them running and accruing costs ("Zombie Databases"). Including a bidirectional update process here may complicate things. + * **Perpetual GitOps Drift:** Since users can create databases via DDL at any time, the `MultigresCluster` CR in Git will rarely match the actual cluster state. `kubectl diff` will be noisy, and the `status` field will become the only reliable view of the system, degrading the value of the declarative spec. This can be mitigated if we have a component that constantly writes these changes to either git and applies them declaratively, but it is not a common pattern. + * **Status Object Bloat (Scalability):** Because the Operator must track "Discovered" (DDL-created) databases in the `status` field to make them visible to SREs, a cluster with thousands of databases risks hitting the Kubernetes object size limit (etcd limits). This limits the scalability of the reporting mechanism compared to fanning out to separate CRs. + * **Resource "Adoption" Friction:** Databases created via DDL are assigned a default `ShardTemplate`. To "upgrade" or resize these databases, an SRE must manually "adopt" them by adding them to the `MultigresCluster` YAML with the correct name and new template. This introduces a manual step and a potential race condition where a new database might be under-provisioned before it can be adopted. + * **Default DB Schema (System Catalog) Availability Dependency:** The Operator's reconciliation loop now strictly depends on the read availability of the "Default Database" (System Catalog). This acts as a single point of failure for the control plane: if the System Catalog is down, the Operator cannot manage any other part of the cluster, even if those other parts are healthy. + * **API Hotspot and "Blast Radius" Risk:** By centralizing all database definitions into the monolithic `MultigresCluster` CR, this single object becomes a massive reconciliation hotspot. A single typo in this large resource (e.g., while "adopting" a database) could potentially break reconciliation for the entire cluster's control plane. + * **Rename/Replace Ambiguity:** Postgres allows `ALTER DATABASE RENAME`, but Kubernetes relies on stable names. If a user renames a database in SQL, the Operator may perceive this as a "Delete" (of the old name) and "Create" (of the new name), potentially attempting to re-provision the old name if it still exists in the YAML. + * **Lack of Namespace Isolation:** All databases must be defined in (or adopted into) the central `MultigresCluster` resource. This effectively forces all application teams to rely on platform admins to manage resource sizing, removing the ability to use Kubernetes RBAC for self-service resource management in separate namespaces. No clean DBA/Platform Engineer persona separation model. + +## Alternatives + +Several alternative designs were considered and rejected in favor of the current parent/child model. + +### Alternative 1: Component CRDs Only (No Parent) + +This model would provide individual, user-managed CRDs for `MultiGateway`, `MultiOrch`, `MultiPooler`, and `Etcd`. Users would be responsible for "composing" a cluster by creating these resources themselves. + + * **Pros:** Maximum flexibility and composability. + * **Cons:** Extremely verbose and complex for a standard deployment. Users must manually create all components and wire them together correctly. + * **Rejected Because:** Makes the common case (deploying a full cluster) unnecessarily complex and error-prone. + +### Alternative 2: Hybrid Model with `managed: true/false` Flag + +This model would feature a top-level `MultigresCluster` CR, but each component section would have a `managed: true/false` flag. If `true`, the operator manages the child resource. If `false`, the operator ignores it. + + * **Pros:** Offers a "best-of-both-worlds" approach. + * **Cons:** Introduces significant complexity around resource ownership and lifecycle (e.g., handling transitions from managed to unmanaged). Creates a high risk of cluster misconfiguration. + * **Rejected Because:** The lifecycle and ownership transitions were deemed too complex and risky for a production-grade operator. + +### Alternative 3: The "Claim" Model (`MultigresDatabase`) + +This design (explored on 2025-11-14) separated the cluster definition from database definitions. Users would create a `MultigresDatabase` CR in their own namespace, which would "claim" resources from the central cluster. + + * **Pros:** Solved the "API Hotspot" problem by fanning out DB definitions. Enabled true Kubernetes-native multi-tenancy via RBAC. provided a clean, declarative target for DDL translation. + * **Cons:** Introduced additional CRDs. + * **Rejected Because:** The client preferred a Postgres-native "System Catalog" approach where the database state is the primary source of truth, rejecting the separation of the "DBA persona" and the additional CRDs. + +### Alternative 4: Helm Charts Instead of CRDs + +This approach would use Helm charts to deploy Multigres components without an operator. + + * **Pros:** Familiar deployment model for many Kubernetes users. + * **Cons:** No automatic reconciliation, no custom status reporting, and no active lifecycle management (failover, scaling, etc.). + * **Rejected Because:** The operator pattern provides superior lifecycle management, observability, and automation, which are critical for a stateful database system. \ No newline at end of file