Skip to content

tabed23/doks-capacity-operator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌊 DOKS Capacity Operator

A tiny Kubernetes operator that keeps your DigitalOcean node pool's autoscale ceiling one step ahead of demand β€” automatically, gradually, and with guardrails.

Go Kubernetes DigitalOcean License Built with kubebuilder


The problem in one breath

DigitalOcean's cluster autoscaler does a great job of adding and removing nodes β€” right up until it hits the node pool's max. That max is a hard ceiling you set once. The day your workload outgrows it, pods go Pending, the autoscaler shrugs, and someone gets paged to bump a number in the dashboard by hand.

The DOKS Capacity Operator removes that someone.

It watches your node pool and, when free headroom runs low, it raises the pool's autoscale max by a small step β€” never past a hard ceiling you control, never faster than a cooldown you set. The DigitalOcean autoscaler then does what it already does best: add the actual nodes.

Think of it as an autoscaler for your autoscaler's ceiling.


Why not just set max high and forget it?

You could set max: 100 on day one. But then a runaway Deployment, a bad rollout, or a traffic spike can scale you straight to 100 nodes β€” and to a surprise invoice β€” in minutes.

This operator gives you the middle path:

Approach Behaviour
Fixed low max Safe on cost, but you hit the wall and pods stay Pending. Manual bumps forever.
Fixed high max Never hit the wall, but any runaway workload scales straight to the top.
DOKS Capacity Operator The effective ceiling tracks demand: it creeps up expandBy nodes at a time, no faster than cooldownMinutes, and never past your hard maxNodes ceiling. Bounded, gradual, observable.

You get a graduated capacity ramp with a circuit breaker instead of a binary choice between "too small" and "wide open."


Features

  • πŸͺœ Graduated expansion β€” raises the pool max in small expandBy steps as headroom shrinks.
  • 🧱 Hard ceiling β€” never raises the pool max above your maxNodes value (DO caps pools at 100).
  • ⏳ Cooldown β€” enforces a minimum gap between expansions so you don't flap.
  • πŸ” Auto-discovery β€” give it a poolID or a poolName and it finds the cluster for you.
  • πŸ›‘ Autoscale-aware β€” if the pool's own autoscaling is off, it refuses to act (raising a max nobody reads is pointless) and tells you why.
  • πŸ‘€ Observable β€” rich status phases, Ready conditions, kubectl printer columns, and a Prometheus metrics endpoint.
  • πŸͺΆ Featherweight β€” 10m CPU / 32Mi memory requested. It does almost nothing, almost all of the time.
  • πŸ“¦ GitOps-native β€” it's a CRD. Commit a YAML, let the controller reconcile.

How it works

flowchart TD
    A["DoksCapacityOperator CR<br/>poolID Β· triggerFreeNodes Β· expandBy<br/>maxNodes (ceiling) Β· cooldownMinutes"] -->|"watch + reconcile every 60s"| B[Capacity Operator]
    B --> C{"Pool autoscaling<br/>enabled?"}
    C -->|No| Z["Phase: Blocked<br/>(raising max would do nothing)"]
    C -->|Yes| D["headroom = poolMax βˆ’ liveCount"]
    D --> E{"headroom ≀<br/>triggerFreeNodes?"}
    E -->|No| S["Phase: Stable"]
    E -->|Yes| F{"Within<br/>cooldown?"}
    F -->|Yes| W["Phase: Cooldown<br/>(requeue when it expires)"]
    F -->|No| G["newMax = poolMax + expandBy<br/>capped at maxNodes ceiling"]
    G --> H{"newMax ><br/>current poolMax?"}
    H -->|No| C2["Phase: AtCeiling"]
    H -->|Yes| I["DO API: UpdateNodePool(max=newMax)<br/>keeps autoscale on, preserves min"]
    I --> X["Phase: Expanded<br/>record lastExpansionTime"]
    X --> Y["DOKS autoscaler adds<br/>real nodes up to new max"]
Loading

In plain English, every 60 seconds the operator:

  1. Resolves the pool β€” directly if you gave it clusterID + poolID, otherwise by scanning the clusters your token can see and matching on poolID/poolName.
  2. Records observed state to status (resolved IDs, live node count, current pool max).
  3. Checks autoscaling β€” if it's disabled on the pool, it stops and reports Blocked.
  4. Measures headroom = poolMax βˆ’ count. Plenty of room β†’ Stable, done.
  5. Respects the cooldown β€” if the trigger fired but the last expansion was too recent, it waits (Cooldown) and requeues exactly when the window opens.
  6. Expands β€” computes poolMax + expandBy, clamps it to the maxNodes ceiling, and if that's actually higher than today's max, calls the DigitalOcean API to raise it. Phase becomes Expanded. If it's already at the ceiling, AtCeiling.

The operator only ever touches the max. It keeps autoscaling on and preserves your min. The real node add/remove decisions stay with DigitalOcean's autoscaler.


Quick start

Prerequisites

  • A running DOKS cluster with autoscaling enabled on the target node pool
  • A DigitalOcean API token with read/write Kubernetes scope
  • kubectl pointed at the cluster
  • Go 1.22+, Docker, and make for building
  • (optional) kubeseal if you want to commit the token as a SealedSecret

1. Build & push the image

make docker-build docker-push IMG=<your-registry>/doks-capacity-operator:0.1.0

2. Install the CRD

make install

3. Provide the DigitalOcean token

The controller reads it from a secret named doks-do-token (key token) in the doks-system namespace:

kubectl create namespace doks-system

kubectl create secret generic doks-do-token \
  --namespace doks-system \
  --from-literal=token="<YOUR_DO_API_TOKEN>"

πŸ’‘ GitOps tip: the repo ships a SealedSecret stub at release/. Encrypt your token with kubeseal and commit the result instead of the plaintext secret above β€” see the comment block in release/release.yaml for the exact command.

4. Deploy the controller

make deploy IMG=<your-registry>/doks-capacity-operator:0.1.0

…or apply the bundled all-in-one manifest (remember to set your real image and imagePullSecrets first):

kubectl apply -f release/release.yaml

5. Tell it which pool to watch

# config/samples/dokscapacityoperator.yaml
apiVersion: platform.mahy.love/v1alpha1
kind: DoksCapacityOperator
metadata:
  name: dokscapacityoperator-sample
spec:
  # No clusterID β€” the operator discovers it from the pool.
  poolID: "your-pool-uuid"   # or use poolName: "chat-pool"
  triggerFreeNodes: 3        # expand when only 3 nodes of headroom remain
  expandBy: 5                # add 5 to the pool's autoscale max each time
  maxNodes: 20               # HARD CEILING β€” never raise the pool max above 20
  cooldownMinutes: 15        # wait 15 min between expansions
kubectl apply -f config/samples/dokscapacityoperator.yaml

6. Watch it work

kubectl get oco
NAME                          POOL                   COUNT   MAX   CEILING   PHASE
dokscapacityoperator-sample   a1b2c3d4-...-pool      8       10    20        Stable

Configuration reference

spec fields of a DoksCapacityOperator:

Field Type Required Default Description
clusterID string – – DO cluster UUID. Leave empty to auto-discover from the pool.
poolID string one of theseΒ² – DO node pool UUID. Preferred β€” globally unique.
poolName string one of theseΒ² – Node pool name. Used only when poolID is empty.
triggerFreeNodes int βœ…ΒΉ 3 Expand when (poolMax βˆ’ count) ≀ this.
expandBy int βœ…ΒΉ 5 Nodes added to the pool's autoscale max per expansion.
maxNodes int βœ… – Hard ceiling. Pool max is raised up to but never above this. Range 1–100.
cooldownMinutes int βœ…ΒΉ 15 Minimum minutes between two expansions.

ΒΉ Required by the schema, but has a default, so you can omit it. Β² At least one of poolID / poolName must be set (enforced by the controller).

Status phases

Phase Meaning
Stable Enough headroom; nothing to do.
Cooldown Trigger hit, but waiting out the cooldown window.
Expanded Just raised the pool's autoscale max.
AtCeiling Pool max is at/above the hard ceiling β€” won't raise further.
Blocked Pool autoscaling is disabled, so raising the max would be a no-op.
Error An API call or pool resolution failed (see .status.message).

Every reconcile also sets a standard Ready condition you can wait on in scripts:

kubectl wait oco/dokscapacityoperator-sample \
  --for=condition=Ready --timeout=120s

Local development

Run the controller against your current kubeconfig without deploying anything:

export DIGITALOCEAN_TOKEN="<YOUR_DO_API_TOKEN>"
make install        # CRDs into the cluster
make run            # runs the manager locally

Other handy targets:

make manifests      # regenerate CRD + RBAC from kubebuilder markers
make generate       # regenerate deepcopy code
make test           # run unit tests (DO client is faked behind an interface)
make build          # compile the binary into ./bin

The DigitalOcean surface the controller needs is captured in a small DOClient interface, so the reconciler is fully unit-testable with a fake β€” no live API calls in tests.


Project layout

doks-capacity-operator/
β”œβ”€β”€ api/v1alpha1/        # CRD types (DoksCapacityOperatorSpec/Status)
β”œβ”€β”€ cmd/                 # manager entrypoint (main.go)
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ controller/      # the reconcile loop
β”‚   └── do/              # thin godo wrapper: ResolvePool, SetMaxNodes
β”œβ”€β”€ config/              # kustomize bases (CRD, RBAC, manager, samples)
β”œβ”€β”€ release/             # all-in-one release.yaml + SealedSecret stub
β”œβ”€β”€ test/                # e2e + integration tests
β”œβ”€β”€ Dockerfile
└── Makefile

Caveats & honest limitations

  • It manages the ceiling, not the nodes. Actual scaling is still DigitalOcean's autoscaler. If pod scheduling is blocked by something other than node count (taints, resource requests, quotas), raising the max won't help.
  • Gradual, not instant. With expandBy: 5 and cooldownMinutes: 15, a sustained spike still climbs over time. That's the point β€” but size expandBy/cooldown for your real burst profile.
  • The hard ceiling is your safety net β€” set it deliberately. This operator only protects you from runaway speed, not from a high ceiling you chose.
  • One pool per CR. Create one DoksCapacityOperator resource per node pool you want managed.

Before you deploy release/release.yaml

The bundled manifest ships with two placeholders you must set for your environment:

  • Image β€” it references image: doks-capacity-operator:0.1.0 (no registry). Point it at your own registry, e.g. <your-registry>/doks-capacity-operator:0.1.0.
  • Pull secret β€” it references imagePullSecrets: [name: secret]. Either create a pull secret with that name in doks-system or update the field to match yours (drop it entirely if your image is public).

πŸ’‘ Prefer to regenerate rather than hand-edit? make manifests rebuilds the CRD and RBAC straight from the +kubebuilder:rbac markers in the controller, so the manifest can never drift from the code.


Contributing

Issues and PRs welcome. Run make test and make manifests before opening a PR so generated artifacts stay in sync.

License

Apache License 2.0. See the headers in each source file.

Built with ❀️ and kubebuilder · github.com/tabed23/doks-capacity-operator

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors