A tiny Kubernetes operator that keeps your DigitalOcean node pool's autoscale ceiling one step ahead of demand β automatically, gradually, and with guardrails.
DigitalOcean's cluster autoscaler does a great job of adding and removing nodes β right up until it hits the node pool's max. That max is a hard ceiling you set once. The day your workload outgrows it, pods go Pending, the autoscaler shrugs, and someone gets paged to bump a number in the dashboard by hand.
The DOKS Capacity Operator removes that someone.
It watches your node pool and, when free headroom runs low, it raises the pool's autoscale max by a small step β never past a hard ceiling you control, never faster than a cooldown you set. The DigitalOcean autoscaler then does what it already does best: add the actual nodes.
Think of it as an autoscaler for your autoscaler's ceiling.
You could set max: 100 on day one. But then a runaway Deployment, a bad rollout, or a traffic spike can scale you straight to 100 nodes β and to a surprise invoice β in minutes.
This operator gives you the middle path:
| Approach | Behaviour |
|---|---|
Fixed low max |
Safe on cost, but you hit the wall and pods stay Pending. Manual bumps forever. |
Fixed high max |
Never hit the wall, but any runaway workload scales straight to the top. |
| DOKS Capacity Operator | The effective ceiling tracks demand: it creeps up expandBy nodes at a time, no faster than cooldownMinutes, and never past your hard maxNodes ceiling. Bounded, gradual, observable. |
You get a graduated capacity ramp with a circuit breaker instead of a binary choice between "too small" and "wide open."
- πͺ Graduated expansion β raises the pool max in small
expandBysteps as headroom shrinks. - π§± Hard ceiling β never raises the pool max above your
maxNodesvalue (DO caps pools at 100). - β³ Cooldown β enforces a minimum gap between expansions so you don't flap.
- π Auto-discovery β give it a
poolIDor apoolNameand it finds the cluster for you. - π Autoscale-aware β if the pool's own autoscaling is off, it refuses to act (raising a max nobody reads is pointless) and tells you why.
- π Observable β rich status phases,
Readyconditions,kubectlprinter columns, and a Prometheus metrics endpoint. - πͺΆ Featherweight β
10mCPU /32Mimemory requested. It does almost nothing, almost all of the time. - π¦ GitOps-native β it's a CRD. Commit a YAML, let the controller reconcile.
flowchart TD
A["DoksCapacityOperator CR<br/>poolID Β· triggerFreeNodes Β· expandBy<br/>maxNodes (ceiling) Β· cooldownMinutes"] -->|"watch + reconcile every 60s"| B[Capacity Operator]
B --> C{"Pool autoscaling<br/>enabled?"}
C -->|No| Z["Phase: Blocked<br/>(raising max would do nothing)"]
C -->|Yes| D["headroom = poolMax β liveCount"]
D --> E{"headroom β€<br/>triggerFreeNodes?"}
E -->|No| S["Phase: Stable"]
E -->|Yes| F{"Within<br/>cooldown?"}
F -->|Yes| W["Phase: Cooldown<br/>(requeue when it expires)"]
F -->|No| G["newMax = poolMax + expandBy<br/>capped at maxNodes ceiling"]
G --> H{"newMax ><br/>current poolMax?"}
H -->|No| C2["Phase: AtCeiling"]
H -->|Yes| I["DO API: UpdateNodePool(max=newMax)<br/>keeps autoscale on, preserves min"]
I --> X["Phase: Expanded<br/>record lastExpansionTime"]
X --> Y["DOKS autoscaler adds<br/>real nodes up to new max"]
In plain English, every 60 seconds the operator:
- Resolves the pool β directly if you gave it
clusterID+poolID, otherwise by scanning the clusters your token can see and matching onpoolID/poolName. - Records observed state to
status(resolved IDs, live node count, current pool max). - Checks autoscaling β if it's disabled on the pool, it stops and reports
Blocked. - Measures headroom =
poolMax β count. Plenty of room βStable, done. - Respects the cooldown β if the trigger fired but the last expansion was too recent, it waits (
Cooldown) and requeues exactly when the window opens. - Expands β computes
poolMax + expandBy, clamps it to themaxNodesceiling, and if that's actually higher than today's max, calls the DigitalOcean API to raise it. Phase becomesExpanded. If it's already at the ceiling,AtCeiling.
The operator only ever touches the max. It keeps autoscaling on and preserves your min. The real node add/remove decisions stay with DigitalOcean's autoscaler.
- A running DOKS cluster with autoscaling enabled on the target node pool
- A DigitalOcean API token with read/write Kubernetes scope
kubectlpointed at the clusterGo 1.22+,Docker, andmakefor building- (optional)
kubesealif you want to commit the token as a SealedSecret
make docker-build docker-push IMG=<your-registry>/doks-capacity-operator:0.1.0make installThe controller reads it from a secret named doks-do-token (key token) in the doks-system namespace:
kubectl create namespace doks-system
kubectl create secret generic doks-do-token \
--namespace doks-system \
--from-literal=token="<YOUR_DO_API_TOKEN>"π‘ GitOps tip: the repo ships a SealedSecret stub at
release/. Encrypt your token withkubesealand commit the result instead of the plaintext secret above β see the comment block inrelease/release.yamlfor the exact command.
make deploy IMG=<your-registry>/doks-capacity-operator:0.1.0β¦or apply the bundled all-in-one manifest (remember to set your real image and imagePullSecrets first):
kubectl apply -f release/release.yaml# config/samples/dokscapacityoperator.yaml
apiVersion: platform.mahy.love/v1alpha1
kind: DoksCapacityOperator
metadata:
name: dokscapacityoperator-sample
spec:
# No clusterID β the operator discovers it from the pool.
poolID: "your-pool-uuid" # or use poolName: "chat-pool"
triggerFreeNodes: 3 # expand when only 3 nodes of headroom remain
expandBy: 5 # add 5 to the pool's autoscale max each time
maxNodes: 20 # HARD CEILING β never raise the pool max above 20
cooldownMinutes: 15 # wait 15 min between expansionskubectl apply -f config/samples/dokscapacityoperator.yamlkubectl get ocoNAME POOL COUNT MAX CEILING PHASE
dokscapacityoperator-sample a1b2c3d4-...-pool 8 10 20 Stable
spec fields of a DoksCapacityOperator:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
clusterID |
string | β | β | DO cluster UUID. Leave empty to auto-discover from the pool. |
poolID |
string | one of theseΒ² | β | DO node pool UUID. Preferred β globally unique. |
poolName |
string | one of theseΒ² | β | Node pool name. Used only when poolID is empty. |
triggerFreeNodes |
int | β ΒΉ | 3 |
Expand when (poolMax β count) β€ this. |
expandBy |
int | β ΒΉ | 5 |
Nodes added to the pool's autoscale max per expansion. |
maxNodes |
int | β | β | Hard ceiling. Pool max is raised up to but never above this. Range 1β100. |
cooldownMinutes |
int | β ΒΉ | 15 |
Minimum minutes between two expansions. |
ΒΉ Required by the schema, but has a default, so you can omit it. Β² At least one of
poolID/poolNamemust be set (enforced by the controller).
| Phase | Meaning |
|---|---|
Stable |
Enough headroom; nothing to do. |
Cooldown |
Trigger hit, but waiting out the cooldown window. |
Expanded |
Just raised the pool's autoscale max. |
AtCeiling |
Pool max is at/above the hard ceiling β won't raise further. |
Blocked |
Pool autoscaling is disabled, so raising the max would be a no-op. |
Error |
An API call or pool resolution failed (see .status.message). |
Every reconcile also sets a standard Ready condition you can wait on in scripts:
kubectl wait oco/dokscapacityoperator-sample \
--for=condition=Ready --timeout=120sRun the controller against your current kubeconfig without deploying anything:
export DIGITALOCEAN_TOKEN="<YOUR_DO_API_TOKEN>"
make install # CRDs into the cluster
make run # runs the manager locallyOther handy targets:
make manifests # regenerate CRD + RBAC from kubebuilder markers
make generate # regenerate deepcopy code
make test # run unit tests (DO client is faked behind an interface)
make build # compile the binary into ./binThe DigitalOcean surface the controller needs is captured in a small DOClient interface, so the reconciler is fully unit-testable with a fake β no live API calls in tests.
doks-capacity-operator/
βββ api/v1alpha1/ # CRD types (DoksCapacityOperatorSpec/Status)
βββ cmd/ # manager entrypoint (main.go)
βββ internal/
β βββ controller/ # the reconcile loop
β βββ do/ # thin godo wrapper: ResolvePool, SetMaxNodes
βββ config/ # kustomize bases (CRD, RBAC, manager, samples)
βββ release/ # all-in-one release.yaml + SealedSecret stub
βββ test/ # e2e + integration tests
βββ Dockerfile
βββ Makefile
- It manages the ceiling, not the nodes. Actual scaling is still DigitalOcean's autoscaler. If pod scheduling is blocked by something other than node count (taints, resource requests, quotas), raising the max won't help.
- Gradual, not instant. With
expandBy: 5andcooldownMinutes: 15, a sustained spike still climbs over time. That's the point β but sizeexpandBy/cooldownfor your real burst profile. - The hard ceiling is your safety net β set it deliberately. This operator only protects you from runaway speed, not from a high ceiling you chose.
- One pool per CR. Create one
DoksCapacityOperatorresource per node pool you want managed.
The bundled manifest ships with two placeholders you must set for your environment:
- Image β it references
image: doks-capacity-operator:0.1.0(no registry). Point it at your own registry, e.g.<your-registry>/doks-capacity-operator:0.1.0. - Pull secret β it references
imagePullSecrets: [name: secret]. Either create a pull secret with that name indoks-systemor update the field to match yours (drop it entirely if your image is public).
π‘ Prefer to regenerate rather than hand-edit?
make manifestsrebuilds the CRD and RBAC straight from the+kubebuilder:rbacmarkers in the controller, so the manifest can never drift from the code.
Issues and PRs welcome. Run make test and make manifests before opening a PR so generated artifacts stay in sync.
Apache License 2.0. See the headers in each source file.
Built with β€οΈ and kubebuilder Β· github.com/tabed23/doks-capacity-operator