feat(scheduler): tolerance-weighted CPU/memory aware VM scheduling#747
feat(scheduler): tolerance-weighted CPU/memory aware VM scheduling#747emanuelebosetti wants to merge 6 commits into
Conversation
80253a5 to
35fea70
Compare
| } | ||
|
|
||
| requestedMemory := uint64(ptr.Deref(machine.Spec.MemoryMiB, 0)) * 1024 * 1024 | ||
| requestedCPUs := int(ptr.Deref(machine.Spec.NumSockets, 0)) * int(ptr.Deref(machine.Spec.NumCores, 0)) |
There was a problem hiding this comment.
I wouldn't cast to int here, just use the dereferenced types of Num{Socket,Cores}.
There was a problem hiding this comment.
Done — requestedCPUs is now int32
| // nodes closer to saturation score disproportionately less than empty nodes. | ||
| saturationExponent = 2.0 | ||
| ) | ||
|
|
There was a problem hiding this comment.
How did you arrive at these constants? They look pretty aggressive. I know that I said I wanted to punish overcommit when physical resources are available but maybe not that hard :)
There was a problem hiding this comment.
These were picked against the heterogeneous cluster test (24 VMs across mixed 64/32-core nodes): legacy lands at 4.4x CPU spread, current constants bring it to 1.43x. What shape did you have in mind?
There was a problem hiding this comment.
The way I read this algo, these consts produce an exponential decay curve (decreases aggressively early on and then tapers off). True, this does produce a very nice and evened-out allocation when the nodes are under-booked. However, as load increases, the contribution of any additional load has a weaker effect on the score.
I was thinking of something closer in shape to the first quarter of a cosine curve where the contribution of any additional load has a stronger effect on the score as load increases. This would mean a less even allocation when under-booked but any over-booked machines are very sharply scored down.
If anything, I'd flip these values.
There was a problem hiding this comment.
I extended the math analysis with Claude to also evaluate a logarithmic family, because once you've moved off r² there's a practical issue worth weighing before committing to the cosine shape. Sharing the full picture below.
How raw is computed today
In resourceScore, before applying any curve, we compute a single "free capacity ratio" per resource:
raw := (freePhys + overbookDiscount*freeOver) / physwith freePhys = max(0, phys − used) and freeOver = max(0, alloc − max(used, phys)). The split is what gives the physical zone its full point per unit and the overcommit zone only half a point per unit. A worked example for a 32-core node with cpuAdjustment: 300 (so alloc = 96):
used |
freePhys |
freeOver |
raw |
|---|---|---|---|
| 0 (empty) | 32 | 64 | 2.0 |
| 16 | 16 | 64 | 1.5 |
| 32 (physical limit hit) | 0 | 64 | 1.0 |
| 64 (deep into overcommit) | 0 | 32 | 0.5 |
| 96 (full) | 0 | 0 | 0.0 |
raw can exceed 1. Without overcommit (adjustment = 100) we have alloc = phys, so freeOver = 0 and raw ∈ [0, 1]. The moment overcommit is enabled (adjustment > 100) the empty-node raw is 1 + 0.5·(adjustment/100 − 1), which is 1.5 at 2:1, 2.0 at 3:1, 3.0 at 5:1. The current r² curve doesn't care since it's monotonically increasing for any real r, but it's a constraint to keep in mind when picking the replacement.
Curve candidates side by side
Raw 2.0 Empty Node, Raw 0 Full Node
raw |
r² (current) |
sin(r·π/2) |
log₂(1+r) |
log(1+9r)/log(10) |
|---|---|---|---|---|
| 0.0 | 0.000 | 0.000 | 0.000 | 0.000 |
| 0.1 | 0.010 | 0.156 | 0.138 | 0.279 |
| 0.3 | 0.090 | 0.454 | 0.379 | 0.568 |
| 0.5 | 0.250 | 0.707 | 0.585 | 0.740 |
| 0.7 | 0.490 | 0.891 | 0.766 | 0.863 |
| 0.9 | 0.810 | 0.988 | 0.926 | 0.959 |
| 1.0 | 1.000 | 1.000 | 1.000 | 1.000 |
| 1.5 | 2.250 | 0.707 | 1.322 | 1.161 |
| 2.0 | 4.000 | 0.000 | 1.585 | 1.279 |
The sin rows past r=1 are the issue. The sine peaks at r=1 and folds back, so a fully empty overcommit-enabled node ends up scoring worse than a half-loaded one. To use the sine we'd need to either clamp raw at 1 (which makes two empty nodes with different overcommit settings indistinguishable) or re-normalize the denominator so raw stays in [0,1] by construction. Both work, both are small changes, but they're extra cognitive load for anyone reading the algorithm.
The logarithm avoids the issue: log₂(1+r) and log(1+9r)/log(10) are monotonic on the whole real line — the plot just keeps climbing into the overcommit zone, no special casing needed. k=9 controls how aggressive the curvature is.
Slope behaviour (the "discrimination" signal)
Slope at the two ends of the physical range tells you how aggressively the scheduler differentiates near-empty nodes vs. near-saturated nodes. Higher ratio = closer to the "sharply scored down when over-booked, less even when under-booked" property you described:
| Curve | slope at r=0 |
slope at r=1 |
ratio | shape |
|---|---|---|---|---|
r² |
0.00 | 2.00 | inverted | concave up — wrong direction |
sin(r·π/2) |
1.57 | 0.00 | ∞ | concave down, flat plateau at r=1 |
log₂(1+r) |
1.44 | 0.72 | 2:1 | concave down, mild plateau |
log(1+9r)/log(10) |
3.91 | 0.39 | 10:1 | concave down, sharper plateau |
The cosine quarter is the most extreme implementation of the property you described (literal "first quarter of a cosine"). The logarithm with k=9 is close in spirit (10:1 ratio, plateau at the top) and removes the overcommit edge case entirely — at the cost of a less symmetric shape.
Two viable directions
- Cosine quarter, with the denominator re-normalized to keep
raw ∈ [0,1]. Most faithful to your original description, requires a small touch-up inresourceScoreto handle overcommit. - Logarithm,
log(1 + k·r) / log(1 + k)withk=9as default. Monotonic for anyraw, no edge cases, tunable steepness via a single constant — but the shape diverges from "cosine".
Which way would you go?
|
I like where this is going but it needs some work. Please let's only have one scheduling algo. |
d39c411 to
70ad5fb
Compare
|
With the defaults of |
|
Unified into a single algorithm — |
| byMemory := make(sortByAvailableMemory, len(allowedNodes)) | ||
| memoryAdjustment := schedulerHints.GetMemoryAdjustment() | ||
| cpuAdjustment := schedulerHints.GetCPUAdjustment() | ||
| cpuEnabled := cpuAdjustment > 0 |
There was a problem hiding this comment.
I don't think cpu needs any special treatment. The algo should just do the right thing for both memory and cpu and work the same way for both resources.
There was a problem hiding this comment.
In the ideal world, right after reading the counts and adjustments, the algo should be generic. This will make it easy to add more kinds of resources to SchedulerHints, e.g. storage.
| // primary: curve-based score (higher wins); tie-break: raw memory | ||
| // headroom (kicks in when many nodes saturate under pending inflation). | ||
| if candidates[i].scoreTotal != candidates[j].scoreTotal { | ||
| return candidates[i].scoreTotal > candidates[j].scoreTotal |
There was a problem hiding this comment.
You've inverted the comparison (SliceStable takes a 'less' function but you are returning the greater value). This needs to be documented better, it's easy to miss; although I would prefer for this algo to work properly given the correct sorting direction.
4b68fd4 to
a1d0e00
Compare
Address upstream issue ionos-cloud#724: on heterogeneous Proxmox clusters with varying CPU core counts the round-robin scheduler causes severe CPU overcommit on smaller nodes. When schedulerHints.cpuAdjustment is non-zero the scheduler switches from round-robin to a tolerance-weighted scoring that considers both CPU and memory headroom. Per-resource scores are shaped by a non-linear curve so already-saturated hosts are heavily penalized and free capacity in the overcommit range is discounted versus the physical range. Operators tune preference via memoryTolerance and cpuTolerance, both inverted into weights internally (weight = (100 - tolerance) / 100). CPU and memory remain hard-fit constraints. Existing NodeLocations entries inflate a node's effective usage to avoid clumping when VMs are cloned in rapid succession (Proxmox API takes seconds to reflect just-cloned VMs). API additions to SchedulerHints: - cpuAdjustment (existing, default 0): overcommit factor; non-zero enables CPU-aware scheduling. - memoryTolerance (new, default 0): lower = more protective of memory. - cpuTolerance (new, default 100): higher = more acceptable to saturate CPU. These defaults are aligned with the memory-first priority stated by the maintainers in the issue thread (ionos-cloud#724) for HA workloads, and deliberately differ from the values proposed in #issuecomment-4259180950 — that proposal was inverted from the priority and would have caused the default CPU-aware mode to ignore memory headroom. Defaults cpuAdjustment: 0 reproduce today's memory-only round-robin, preserving full backward compatibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Revert all v1alpha1 SchedulerHints additions: proxmoxcluster_types.go is now byte-identical to upstream. The three v1alpha2-only fields (cpuAdjustment, memoryTolerance, cpuTolerance) are dropped on v1alpha2 → v1alpha1 down-conversion and restored from the conversion-data annotation via a shared restoreSchedulerHints helper used by both ProxmoxCluster and ProxmoxClusterTemplate ConvertTo paths. Operators writing CRs with apiVersion: v1alpha1 receive a kubectl "unknown field" warning if they try to set the new fields, and the values are silently pruned by the apiserver before storage. This is the expected behaviour for fields that live only in the storage version (v1alpha2); the v1alpha1 schema stays untouched. Also trim the memoryTolerance/cpuTolerance godoc to user-facing semantics only (no internal weighting formula, no scoring details). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback: keep the dereferenced int32 type of NumSockets and NumCores instead of casting to int at the multiplication site. Propagate int32 through selectNodeByToleranceScoring and buildInsufficientError; cast at the nodeInfo boundary where the resourceClient interface still uses int (Proxmox API type). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Grammar fix ("allows to allocate" -> "allows allocating") and mirror
the MemoryAdjustment example for sub-100 values (95 reserves capacity
for the host).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback: collapse the round-robin and tolerance-weighted paths into a single algorithm. selectNode now always runs the same curve-based scoring; when cpuAdjustment is 0 CPU is not fetched, not enforced as a hard-fit constraint, and does not contribute to the score, so the algorithm degrades to memory-only curve scoring without a separate code path. - Remove selectNodeByRoundRobin and the sortByReplicas/ScheduledVMs helpers; selectNode owns the entire scheduling logic. - Move pending inflation from the hard-fit check into the score calculation only, so up-to-date Proxmox availability data is never double-counted; the score still uses pending inflation as a soft anti-clumping signal. - Restore the original code comments that lived in selectNode upstream (convert to bytes, count placements per node, log construction). - Update cpuAdjustment godoc to describe behaviour without referring to a separate algorithm or to "round-robin". The legacy round-robin tests are updated to reflect the new behaviour. Cumulative placement tallies and InsufficientMemoryError trigger points are unchanged; only the order within a pass differs because the new algo picks by memory headroom score rather than by replica count first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1290a5f to
63698a4
Compare
|


Issue #, if available: #724
Description of changes:
Map memory-first priority into a tolerance-weighted CPU/memory scheduler so heterogeneous Proxmox clusters stop concentrating VMs on smaller-CPU nodes by accident.
How it works
When
schedulerHints.cpuAdjustmentis set to a non-zero value, the scheduler switches from round-robin to a tolerance-weighted scoring that considers both CPU and memory headroom on every candidate node:saturationExponent = 2.0) so already-saturated hosts are heavily penalized.overbookDiscount = 0.5).memoryToleranceandcpuToleranceviaweight = (100 − tolerance) / 100. High tolerance = operator OK with that resource saturating = less influence on the decision.InsufficientCPUError/InsufficientMemoryError.status.NodeLocationsinflate a node's effective usage, so VMs cloned in rapid succession (Proxmox API takes seconds to reflect just-cloned VMs) don't all converge on the same target.API changes (additive, backwards-compatible)
cpuAdjustment(existing)*int640memoryTolerance(new)*int64(0..100)0cpuTolerance(new)*int64(0..100)100cpuAdjustment: 0(default) reproduces today's memory-only round-robin — full backward compatibility.A note on the chosen defaults
The defaults ship aligned with the memory-first priority discussed in the issue thread for HA workloads: when CPU-aware scheduling is enabled, memory is fully protected and CPU acts only as a hard-fit constraint until the operator opts in to spreading.
These values deliberately differ from those I proposed earlier in the thread (
memoryTolerance: 100,cpuTolerance: 0). Re-checking that proposal end-to-end, those values are inverted from the priority discussed earlier and would have caused the default CPU-aware mode to ignore memory headroom. Happy to revisit if you'd rather keep my original proposed defaults — the change is two constants and a couple of test fixtures.Testing performed:
make lint test verifypass locally (envtest 1.33.0). Thepkg/convert/sentinel.gogoconst/nolintlintwarnings are pre-existing onmainand unrelated to this PR.internal/service/scheduleris 87.1%.MachineDeploymentrollouts on a 5-machine workload cluster withcpuAdjustment: 200and confirmedselectNodeByToleranceScoringruns on every new VM, applies the documented defaults (memoryTolerance=0 cpuTolerance=100→cpuWeight=0, memory drives), and the anti-clumpingNodeLocationsinflation correctly reduces a node's score after each placement (observedscoreMemdrop from0.807to0.762between two consecutive decisions on the same node within ~1s).Attribution
Developed with assistance from Claude (Opus 4.7), as per the AI attribution practice followed in previous PRs.