feat(scheduler): tolerance-weighted CPU/memory aware VM scheduling by emanuelebosetti · Pull Request #747 · ionos-cloud/cluster-api-provider-proxmox

emanuelebosetti · 2026-05-08T16:15:34Z

Issue #, if available: #724

Description of changes:

Map memory-first priority into a tolerance-weighted CPU/memory scheduler so heterogeneous Proxmox clusters stop concentrating VMs on smaller-CPU nodes by accident.

How it works

When schedulerHints.cpuAdjustment is set to a non-zero value, the scheduler switches from round-robin to a tolerance-weighted scoring that considers both CPU and memory headroom on every candidate node:

Per-resource score is shaped by a non-linear curve (saturationExponent = 2.0) so already-saturated hosts are heavily penalized.
Free capacity in the physical range counts fully (1 point/unit); free capacity in the overcommit range is discounted (overbookDiscount = 0.5).
Per-resource weights are derived from memoryTolerance and cpuTolerance via weight = (100 − tolerance) / 100. High tolerance = operator OK with that resource saturating = less influence on the decision.
CPU and memory remain hard-fit constraints; nodes that cannot fit the request are excluded upfront with InsufficientCPUError / InsufficientMemoryError.
Anti-clumping: existing placements recorded in status.NodeLocations inflate a node's effective usage, so VMs cloned in rapid succession (Proxmox API takes seconds to reflect just-cloned VMs) don't all converge on the same target.

API changes (additive, backwards-compatible)

Field	Type	Default	Meaning
`cpuAdjustment` (existing)	`*int64`	`0`	Overcommit factor; non-zero enables CPU-aware scheduling
`memoryTolerance` (new)	`*int64` (0..100)	`0`	Lower = more protective of memory
`cpuTolerance` (new)	`*int64` (0..100)	`100`	Higher = more acceptable to saturate CPU

cpuAdjustment: 0 (default) reproduces today's memory-only round-robin — full backward compatibility.

A note on the chosen defaults

The defaults ship aligned with the memory-first priority discussed in the issue thread for HA workloads: when CPU-aware scheduling is enabled, memory is fully protected and CPU acts only as a hard-fit constraint until the operator opts in to spreading.

These values deliberately differ from those I proposed earlier in the thread (memoryTolerance: 100, cpuTolerance: 0). Re-checking that proposal end-to-end, those values are inverted from the priority discussed earlier and would have caused the default CPU-aware mode to ignore memory headroom. Happy to revisit if you'd rather keep my original proposed defaults — the change is two constants and a couple of test fixtures.

Testing performed:

Unit tests for tolerance scoring covering: defaults follow memory, opposite polarity follows CPU, overbook penalty, hard-fit constraints (CPU and memory), heterogeneous-cluster regression, anti-clumping under rapid scale.
make lint test verify pass locally (envtest 1.33.0). The pkg/convert/sentinel.go goconst/nolintlint warnings are pre-existing on main and unrelated to this PR.
Algorithm coverage on internal/service/scheduler is 87.1%.
Validated on a real Proxmox cluster (10 nodes across two physical hosts): triggered MachineDeployment rollouts on a 5-machine workload cluster with cpuAdjustment: 200 and confirmed selectNodeByToleranceScoring runs on every new VM, applies the documented defaults (memoryTolerance=0 cpuTolerance=100 → cpuWeight=0, memory drives), and the anti-clumping NodeLocations inflation correctly reduces a node's score after each placement (observed scoreMem drop from 0.807 to 0.762 between two consecutive decisions on the same node within ~1s).

Attribution

Developed with assistance from Claude (Opus 4.7), as per the AI attribution practice followed in previous PRs.

wikkyk · 2026-05-11T12:16:54Z

 	}

+	requestedMemory := uint64(ptr.Deref(machine.Spec.MemoryMiB, 0)) * 1024 * 1024
+	requestedCPUs := int(ptr.Deref(machine.Spec.NumSockets, 0)) * int(ptr.Deref(machine.Spec.NumCores, 0))


I wouldn't cast to int here, just use the dereferenced types of Num{Socket,Cores}.

Done — requestedCPUs is now int32

wikkyk · 2026-05-11T12:23:01Z

+	// nodes closer to saturation score disproportionately less than empty nodes.
+	saturationExponent = 2.0
+)
+


How did you arrive at these constants? They look pretty aggressive. I know that I said I wanted to punish overcommit when physical resources are available but maybe not that hard :)

These were picked against the heterogeneous cluster test (24 VMs across mixed 64/32-core nodes): legacy lands at 4.4x CPU spread, current constants bring it to 1.43x. What shape did you have in mind?

The way I read this algo, these consts produce an exponential decay curve (decreases aggressively early on and then tapers off). True, this does produce a very nice and evened-out allocation when the nodes are under-booked. However, as load increases, the contribution of any additional load has a weaker effect on the score.
I was thinking of something closer in shape to the first quarter of a cosine curve where the contribution of any additional load has a stronger effect on the score as load increases. This would mean a less even allocation when under-booked but any over-booked machines are very sharply scored down.
If anything, I'd flip these values.

I extended the math analysis with Claude to also evaluate a logarithmic family, because once you've moved off r² there's a practical issue worth weighing before committing to the cosine shape. Sharing the full picture below.

How raw is computed today

In resourceScore, before applying any curve, we compute a single "free capacity ratio" per resource:

raw := (freePhys + overbookDiscount*freeOver) / phys

with freePhys = max(0, phys − used) and freeOver = max(0, alloc − max(used, phys)). The split is what gives the physical zone its full point per unit and the overcommit zone only half a point per unit. A worked example for a 32-core node with cpuAdjustment: 300 (so alloc = 96):

used freePhys freeOver raw

0 (empty) 32 64 2.0

16 16 64 1.5

32 (physical limit hit) 0 64 1.0

64 (deep into overcommit) 0 32 0.5

96 (full) 0 0 0.0

raw can exceed 1. Without overcommit (adjustment = 100) we have alloc = phys, so freeOver = 0 and raw ∈ [0, 1]. The moment overcommit is enabled (adjustment > 100) the empty-node raw is 1 + 0.5·(adjustment/100 − 1), which is 1.5 at 2:1, 2.0 at 3:1, 3.0 at 5:1. The current r² curve doesn't care since it's monotonically increasing for any real r, but it's a constraint to keep in mind when picking the replacement.

Curve candidates side by side
Raw 2.0 Empty Node, Raw 0 Full Node

raw r² (current) sin(r·π/2) log₂(1+r) log(1+9r)/log(10)

0.0 0.000 0.000 0.000 0.000

0.1 0.010 0.156 0.138 0.279

0.3 0.090 0.454 0.379 0.568

0.5 0.250 0.707 0.585 0.740

0.7 0.490 0.891 0.766 0.863

0.9 0.810 0.988 0.926 0.959

1.0 1.000 1.000 1.000 1.000

1.5 2.250 0.707 1.322 1.161

2.0 4.000 0.000 1.585 1.279

The sin rows past r=1 are the issue. The sine peaks at r=1 and folds back, so a fully empty overcommit-enabled node ends up scoring worse than a half-loaded one. To use the sine we'd need to either clamp raw at 1 (which makes two empty nodes with different overcommit settings indistinguishable) or re-normalize the denominator so raw stays in [0,1] by construction. Both work, both are small changes, but they're extra cognitive load for anyone reading the algorithm.

The logarithm avoids the issue: log₂(1+r) and log(1+9r)/log(10) are monotonic on the whole real line — the plot just keeps climbing into the overcommit zone, no special casing needed. k=9 controls how aggressive the curvature is.

Slope behaviour (the "discrimination" signal)

Slope at the two ends of the physical range tells you how aggressively the scheduler differentiates near-empty nodes vs. near-saturated nodes. Higher ratio = closer to the "sharply scored down when over-booked, less even when under-booked" property you described:

Curve slope at r=0 slope at r=1 ratio shape

r² 0.00 2.00 inverted concave up — wrong direction

sin(r·π/2) 1.57 0.00 ∞ concave down, flat plateau at r=1

log₂(1+r) 1.44 0.72 2:1 concave down, mild plateau

log(1+9r)/log(10) 3.91 0.39 10:1 concave down, sharper plateau

The cosine quarter is the most extreme implementation of the property you described (literal "first quarter of a cosine"). The logarithm with k=9 is close in spirit (10:1 ratio, plateau at the top) and removes the overcommit edge case entirely — at the cost of a less symmetric shape.

Two viable directions

Cosine quarter, with the denominator re-normalized to keep raw ∈ [0,1]. Most faithful to your original description, requires a small touch-up in resourceScore to handle overcommit.

Logarithm, log(1 + k·r) / log(1 + k) with k=9 as default. Monotonic for any raw, no edge cases, tunable steepness via a single constant — but the shape diverges from "cosine".

Which way would you go?

wikkyk · 2026-05-11T12:32:35Z

I like where this is going but it needs some work.

Please let's only have one scheduling algo.
You can still use the new algo when CPU-aware placement is disabled. Overcommit handling being on a curve rather than linear is useful. For small overcommit it's hardly a visible change and for high overcommit is an improvement. You shouldn't need to enable CPU-aware placement to get the benefit.
Plus, two algos are more expensive to maintain than one.
Behaviour is more difficult to reason about when there are two behaviours to consider, which is extra brain load for ops.

wikkyk · 2026-05-11T15:07:02Z

With the defaults of MemoryTolerance=0 and CPUTolerance=100 plus the curve-based overcommit scoring, we don't really need to switch tolerance on overcommit. Instead, we can just always use the curve-based algo, which would also simplify it as there is no need for the cpuAware branches then.

emanuelebosetti · 2026-05-11T17:12:30Z

Unified into a single algorithm — selectNodeByRoundRobin is gone, selectNode always runs the curve-based scoring. With cpuAdjustment=0 CPU is not fetched, not enforced, and contributes 0 to the score, so the algorithm naturally degrades to memory-only without a separate code path. Unit tests cover the heterogeneous case and the rapid-scale anti-clumping scenario; I'd like to validate it on a real Proxmox cluster tomorrow before considering this fully done.

wikkyk · 2026-05-12T11:01:54Z

-	byMemory := make(sortByAvailableMemory, len(allowedNodes))
+	memoryAdjustment := schedulerHints.GetMemoryAdjustment()
+	cpuAdjustment := schedulerHints.GetCPUAdjustment()
+	cpuEnabled := cpuAdjustment > 0


I don't think cpu needs any special treatment. The algo should just do the right thing for both memory and cpu and work the same way for both resources.

In the ideal world, right after reading the counts and adjustments, the algo should be generic. This will make it easy to add more kinds of resources to SchedulerHints, e.g. storage.

wikkyk · 2026-05-12T11:50:41Z

+		// primary: curve-based score (higher wins); tie-break: raw memory
+		// headroom (kicks in when many nodes saturate under pending inflation).
+		if candidates[i].scoreTotal != candidates[j].scoreTotal {
+			return candidates[i].scoreTotal > candidates[j].scoreTotal


You've inverted the comparison (SliceStable takes a 'less' function but you are returning the greater value). This needs to be documented better, it's easy to miss; although I would prefer for this algo to work properly given the correct sorting direction.

Address upstream issue ionos-cloud#724: on heterogeneous Proxmox clusters with varying CPU core counts the round-robin scheduler causes severe CPU overcommit on smaller nodes. When schedulerHints.cpuAdjustment is non-zero the scheduler switches from round-robin to a tolerance-weighted scoring that considers both CPU and memory headroom. Per-resource scores are shaped by a non-linear curve so already-saturated hosts are heavily penalized and free capacity in the overcommit range is discounted versus the physical range. Operators tune preference via memoryTolerance and cpuTolerance, both inverted into weights internally (weight = (100 - tolerance) / 100). CPU and memory remain hard-fit constraints. Existing NodeLocations entries inflate a node's effective usage to avoid clumping when VMs are cloned in rapid succession (Proxmox API takes seconds to reflect just-cloned VMs). API additions to SchedulerHints: - cpuAdjustment (existing, default 0): overcommit factor; non-zero enables CPU-aware scheduling. - memoryTolerance (new, default 0): lower = more protective of memory. - cpuTolerance (new, default 100): higher = more acceptable to saturate CPU. These defaults are aligned with the memory-first priority stated by the maintainers in the issue thread (ionos-cloud#724) for HA workloads, and deliberately differ from the values proposed in #issuecomment-4259180950 — that proposal was inverted from the priority and would have caused the default CPU-aware mode to ignore memory headroom. Defaults cpuAdjustment: 0 reproduce today's memory-only round-robin, preserving full backward compatibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Revert all v1alpha1 SchedulerHints additions: proxmoxcluster_types.go is now byte-identical to upstream. The three v1alpha2-only fields (cpuAdjustment, memoryTolerance, cpuTolerance) are dropped on v1alpha2 → v1alpha1 down-conversion and restored from the conversion-data annotation via a shared restoreSchedulerHints helper used by both ProxmoxCluster and ProxmoxClusterTemplate ConvertTo paths. Operators writing CRs with apiVersion: v1alpha1 receive a kubectl "unknown field" warning if they try to set the new fields, and the values are silently pruned by the apiserver before storage. This is the expected behaviour for fields that live only in the storage version (v1alpha2); the v1alpha1 schema stays untouched. Also trim the memoryTolerance/cpuTolerance godoc to user-facing semantics only (no internal weighting formula, no scoring details). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address review feedback: keep the dereferenced int32 type of NumSockets and NumCores instead of casting to int at the multiplication site. Propagate int32 through selectNodeByToleranceScoring and buildInsufficientError; cast at the nodeInfo boundary where the resourceClient interface still uses int (Proxmox API type). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Grammar fix ("allows to allocate" -> "allows allocating") and mirror the MemoryAdjustment example for sub-100 values (95 reserves capacity for the host). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address review feedback: collapse the round-robin and tolerance-weighted paths into a single algorithm. selectNode now always runs the same curve-based scoring; when cpuAdjustment is 0 CPU is not fetched, not enforced as a hard-fit constraint, and does not contribute to the score, so the algorithm degrades to memory-only curve scoring without a separate code path. - Remove selectNodeByRoundRobin and the sortByReplicas/ScheduledVMs helpers; selectNode owns the entire scheduling logic. - Move pending inflation from the hard-fit check into the score calculation only, so up-to-date Proxmox availability data is never double-counted; the score still uses pending inflation as a soft anti-clumping signal. - Restore the original code comments that lived in selectNode upstream (convert to bytes, count placements per node, log construction). - Update cpuAdjustment godoc to describe behaviour without referring to a separate algorithm or to "round-robin". The legacy round-robin tests are updated to reflect the new behaviour. Cumulative placement tallies and InsufficientMemoryError trigger points are unchanged; only the order within a pass differs because the new algo picks by memory headroom score rather than by replica count first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-05-13T13:32:20Z

Quality Gate failed

Failed conditions
68.3% Coverage on New Code (required ≥ 81%)

See analysis details on SonarQube Cloud

emanuelebosetti requested review from 65278, mcbenjemaa, pborn-ionos and wikkyk as code owners May 8, 2026 16:15

emanuelebosetti had a problem deploying to e2e May 8, 2026 16:15 — with GitHub Actions Error

wikkyk force-pushed the feat/scheduler-saturation branch from 80253a5 to 35fea70 Compare May 11, 2026 09:30

wikkyk had a problem deploying to e2e May 11, 2026 09:30 — with GitHub Actions Error

wikkyk added the feature New feature label May 11, 2026

wikkyk had a problem deploying to e2e May 11, 2026 09:30 — with GitHub Actions Error