Skip to content

feat(scheduler): tolerance-weighted CPU/memory aware VM scheduling#747

Open
emanuelebosetti wants to merge 6 commits into
ionos-cloud:mainfrom
ElmecOSS:feat/scheduler-saturation
Open

feat(scheduler): tolerance-weighted CPU/memory aware VM scheduling#747
emanuelebosetti wants to merge 6 commits into
ionos-cloud:mainfrom
ElmecOSS:feat/scheduler-saturation

Conversation

@emanuelebosetti
Copy link
Copy Markdown

Issue #, if available: #724

Description of changes:

Map memory-first priority into a tolerance-weighted CPU/memory scheduler so heterogeneous Proxmox clusters stop concentrating VMs on smaller-CPU nodes by accident.

How it works

When schedulerHints.cpuAdjustment is set to a non-zero value, the scheduler switches from round-robin to a tolerance-weighted scoring that considers both CPU and memory headroom on every candidate node:

  • Per-resource score is shaped by a non-linear curve (saturationExponent = 2.0) so already-saturated hosts are heavily penalized.
  • Free capacity in the physical range counts fully (1 point/unit); free capacity in the overcommit range is discounted (overbookDiscount = 0.5).
  • Per-resource weights are derived from memoryTolerance and cpuTolerance via weight = (100 − tolerance) / 100. High tolerance = operator OK with that resource saturating = less influence on the decision.
  • CPU and memory remain hard-fit constraints; nodes that cannot fit the request are excluded upfront with InsufficientCPUError / InsufficientMemoryError.
  • Anti-clumping: existing placements recorded in status.NodeLocations inflate a node's effective usage, so VMs cloned in rapid succession (Proxmox API takes seconds to reflect just-cloned VMs) don't all converge on the same target.

API changes (additive, backwards-compatible)

Field Type Default Meaning
cpuAdjustment (existing) *int64 0 Overcommit factor; non-zero enables CPU-aware scheduling
memoryTolerance (new) *int64 (0..100) 0 Lower = more protective of memory
cpuTolerance (new) *int64 (0..100) 100 Higher = more acceptable to saturate CPU

cpuAdjustment: 0 (default) reproduces today's memory-only round-robin — full backward compatibility.

A note on the chosen defaults

The defaults ship aligned with the memory-first priority discussed in the issue thread for HA workloads: when CPU-aware scheduling is enabled, memory is fully protected and CPU acts only as a hard-fit constraint until the operator opts in to spreading.

These values deliberately differ from those I proposed earlier in the thread (memoryTolerance: 100, cpuTolerance: 0). Re-checking that proposal end-to-end, those values are inverted from the priority discussed earlier and would have caused the default CPU-aware mode to ignore memory headroom. Happy to revisit if you'd rather keep my original proposed defaults — the change is two constants and a couple of test fixtures.

Testing performed:

  • Unit tests for tolerance scoring covering: defaults follow memory, opposite polarity follows CPU, overbook penalty, hard-fit constraints (CPU and memory), heterogeneous-cluster regression, anti-clumping under rapid scale.
  • make lint test verify pass locally (envtest 1.33.0). The pkg/convert/sentinel.go goconst/nolintlint warnings are pre-existing on main and unrelated to this PR.
  • Algorithm coverage on internal/service/scheduler is 87.1%.
  • Validated on a real Proxmox cluster (10 nodes across two physical hosts): triggered MachineDeployment rollouts on a 5-machine workload cluster with cpuAdjustment: 200 and confirmed selectNodeByToleranceScoring runs on every new VM, applies the documented defaults (memoryTolerance=0 cpuTolerance=100cpuWeight=0, memory drives), and the anti-clumping NodeLocations inflation correctly reduces a node's score after each placement (observed scoreMem drop from 0.807 to 0.762 between two consecutive decisions on the same node within ~1s).

Attribution

Developed with assistance from Claude (Opus 4.7), as per the AI attribution practice followed in previous PRs.

Comment thread api/v1alpha1/proxmoxcluster_types.go
Comment thread api/v1alpha2/proxmoxcluster_types.go Outdated
}

requestedMemory := uint64(ptr.Deref(machine.Spec.MemoryMiB, 0)) * 1024 * 1024
requestedCPUs := int(ptr.Deref(machine.Spec.NumSockets, 0)) * int(ptr.Deref(machine.Spec.NumCores, 0))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't cast to int here, just use the dereferenced types of Num{Socket,Cores}.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — requestedCPUs is now int32

Comment thread internal/service/scheduler/vmscheduler.go Outdated
// nodes closer to saturation score disproportionately less than empty nodes.
saturationExponent = 2.0
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive at these constants? They look pretty aggressive. I know that I said I wanted to punish overcommit when physical resources are available but maybe not that hard :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were picked against the heterogeneous cluster test (24 VMs across mixed 64/32-core nodes): legacy lands at 4.4x CPU spread, current constants bring it to 1.43x. What shape did you have in mind?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read this algo, these consts produce an exponential decay curve (decreases aggressively early on and then tapers off). True, this does produce a very nice and evened-out allocation when the nodes are under-booked. However, as load increases, the contribution of any additional load has a weaker effect on the score.
I was thinking of something closer in shape to the first quarter of a cosine curve where the contribution of any additional load has a stronger effect on the score as load increases. This would mean a less even allocation when under-booked but any over-booked machines are very sharply scored down.
If anything, I'd flip these values.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended the math analysis with Claude to also evaluate a logarithmic family, because once you've moved off there's a practical issue worth weighing before committing to the cosine shape. Sharing the full picture below.

How raw is computed today

In resourceScore, before applying any curve, we compute a single "free capacity ratio" per resource:

raw := (freePhys + overbookDiscount*freeOver) / phys

with freePhys = max(0, phys − used) and freeOver = max(0, alloc − max(used, phys)). The split is what gives the physical zone its full point per unit and the overcommit zone only half a point per unit. A worked example for a 32-core node with cpuAdjustment: 300 (so alloc = 96):

used freePhys freeOver raw
0 (empty) 32 64 2.0
16 16 64 1.5
32 (physical limit hit) 0 64 1.0
64 (deep into overcommit) 0 32 0.5
96 (full) 0 0 0.0

raw can exceed 1. Without overcommit (adjustment = 100) we have alloc = phys, so freeOver = 0 and raw ∈ [0, 1]. The moment overcommit is enabled (adjustment > 100) the empty-node raw is 1 + 0.5·(adjustment/100 − 1), which is 1.5 at 2:1, 2.0 at 3:1, 3.0 at 5:1. The current curve doesn't care since it's monotonically increasing for any real r, but it's a constraint to keep in mind when picking the replacement.

Curve candidates side by side
Raw 2.0 Empty Node, Raw 0 Full Node

Image
raw (current) sin(r·π/2) log₂(1+r) log(1+9r)/log(10)
0.0 0.000 0.000 0.000 0.000
0.1 0.010 0.156 0.138 0.279
0.3 0.090 0.454 0.379 0.568
0.5 0.250 0.707 0.585 0.740
0.7 0.490 0.891 0.766 0.863
0.9 0.810 0.988 0.926 0.959
1.0 1.000 1.000 1.000 1.000
1.5 2.250 0.707 1.322 1.161
2.0 4.000 0.000 1.585 1.279

The sin rows past r=1 are the issue. The sine peaks at r=1 and folds back, so a fully empty overcommit-enabled node ends up scoring worse than a half-loaded one. To use the sine we'd need to either clamp raw at 1 (which makes two empty nodes with different overcommit settings indistinguishable) or re-normalize the denominator so raw stays in [0,1] by construction. Both work, both are small changes, but they're extra cognitive load for anyone reading the algorithm.

The logarithm avoids the issue: log₂(1+r) and log(1+9r)/log(10) are monotonic on the whole real line — the plot just keeps climbing into the overcommit zone, no special casing needed. k=9 controls how aggressive the curvature is.

Slope behaviour (the "discrimination" signal)

Slope at the two ends of the physical range tells you how aggressively the scheduler differentiates near-empty nodes vs. near-saturated nodes. Higher ratio = closer to the "sharply scored down when over-booked, less even when under-booked" property you described:

Curve slope at r=0 slope at r=1 ratio shape
0.00 2.00 inverted concave up — wrong direction
sin(r·π/2) 1.57 0.00 concave down, flat plateau at r=1
log₂(1+r) 1.44 0.72 2:1 concave down, mild plateau
log(1+9r)/log(10) 3.91 0.39 10:1 concave down, sharper plateau

The cosine quarter is the most extreme implementation of the property you described (literal "first quarter of a cosine"). The logarithm with k=9 is close in spirit (10:1 ratio, plateau at the top) and removes the overcommit edge case entirely — at the cost of a less symmetric shape.

Two viable directions

  1. Cosine quarter, with the denominator re-normalized to keep raw ∈ [0,1]. Most faithful to your original description, requires a small touch-up in resourceScore to handle overcommit.
  2. Logarithm, log(1 + k·r) / log(1 + k) with k=9 as default. Monotonic for any raw, no edge cases, tunable steepness via a single constant — but the shape diverges from "cosine".

Which way would you go?

@wikkyk
Copy link
Copy Markdown
Collaborator

wikkyk commented May 11, 2026

I like where this is going but it needs some work.

Please let's only have one scheduling algo.
You can still use the new algo when CPU-aware placement is disabled. Overcommit handling being on a curve rather than linear is useful. For small overcommit it's hardly a visible change and for high overcommit is an improvement. You shouldn't need to enable CPU-aware placement to get the benefit.
Plus, two algos are more expensive to maintain than one.
Behaviour is more difficult to reason about when there are two behaviours to consider, which is extra brain load for ops.

Comment thread api/v1alpha1/conversion.go Outdated
Comment thread api/v1alpha2/proxmoxcluster_types.go Outdated
Comment thread api/v1alpha2/proxmoxcluster_types.go Outdated
@wikkyk
Copy link
Copy Markdown
Collaborator

wikkyk commented May 11, 2026

With the defaults of MemoryTolerance=0 and CPUTolerance=100 plus the curve-based overcommit scoring, we don't really need to switch tolerance on overcommit. Instead, we can just always use the curve-based algo, which would also simplify it as there is no need for the cpuAware branches then.

@emanuelebosetti
Copy link
Copy Markdown
Author

Unified into a single algorithm — selectNodeByRoundRobin is gone, selectNode always runs the curve-based scoring. With cpuAdjustment=0 CPU is not fetched, not enforced, and contributes 0 to the score, so the algorithm naturally degrades to memory-only without a separate code path. Unit tests cover the heterogeneous case and the rapid-scale anti-clumping scenario; I'd like to validate it on a real Proxmox cluster tomorrow before considering this fully done.

byMemory := make(sortByAvailableMemory, len(allowedNodes))
memoryAdjustment := schedulerHints.GetMemoryAdjustment()
cpuAdjustment := schedulerHints.GetCPUAdjustment()
cpuEnabled := cpuAdjustment > 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think cpu needs any special treatment. The algo should just do the right thing for both memory and cpu and work the same way for both resources.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ideal world, right after reading the counts and adjustments, the algo should be generic. This will make it easy to add more kinds of resources to SchedulerHints, e.g. storage.

// primary: curve-based score (higher wins); tie-break: raw memory
// headroom (kicks in when many nodes saturate under pending inflation).
if candidates[i].scoreTotal != candidates[j].scoreTotal {
return candidates[i].scoreTotal > candidates[j].scoreTotal
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've inverted the comparison (SliceStable takes a 'less' function but you are returning the greater value). This needs to be documented better, it's easy to miss; although I would prefer for this algo to work properly given the correct sorting direction.

@wikkyk wikkyk force-pushed the main branch 2 times, most recently from 4b68fd4 to a1d0e00 Compare May 13, 2026 12:26
Address upstream issue ionos-cloud#724: on heterogeneous Proxmox clusters with
varying CPU core counts the round-robin scheduler causes severe CPU
overcommit on smaller nodes.

When schedulerHints.cpuAdjustment is non-zero the scheduler switches
from round-robin to a tolerance-weighted scoring that considers both
CPU and memory headroom. Per-resource scores are shaped by a non-linear
curve so already-saturated hosts are heavily penalized and free
capacity in the overcommit range is discounted versus the physical
range. Operators tune preference via memoryTolerance and cpuTolerance,
both inverted into weights internally (weight = (100 - tolerance) /
100). CPU and memory remain hard-fit constraints. Existing
NodeLocations entries inflate a node's effective usage to avoid
clumping when VMs are cloned in rapid succession (Proxmox API takes
seconds to reflect just-cloned VMs).

API additions to SchedulerHints:
- cpuAdjustment (existing, default 0): overcommit factor; non-zero
  enables CPU-aware scheduling.
- memoryTolerance (new, default 0): lower = more protective of memory.
- cpuTolerance (new, default 100): higher = more acceptable to
  saturate CPU.

These defaults are aligned with the memory-first priority stated by
the maintainers in the issue thread (ionos-cloud#724) for HA workloads, and
deliberately differ from the values proposed in
#issuecomment-4259180950 — that proposal was inverted from the
priority and would have caused the default CPU-aware mode to ignore
memory headroom.

Defaults cpuAdjustment: 0 reproduce today's memory-only round-robin,
preserving full backward compatibility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
emanuelebosetti and others added 5 commits May 13, 2026 15:17
Revert all v1alpha1 SchedulerHints additions: proxmoxcluster_types.go
is now byte-identical to upstream. The three v1alpha2-only fields
(cpuAdjustment, memoryTolerance, cpuTolerance) are dropped on
v1alpha2 → v1alpha1 down-conversion and restored from the
conversion-data annotation via a shared restoreSchedulerHints helper
used by both ProxmoxCluster and ProxmoxClusterTemplate ConvertTo paths.

Operators writing CRs with apiVersion: v1alpha1 receive a kubectl
"unknown field" warning if they try to set the new fields, and the
values are silently pruned by the apiserver before storage. This is
the expected behaviour for fields that live only in the storage
version (v1alpha2); the v1alpha1 schema stays untouched.

Also trim the memoryTolerance/cpuTolerance godoc to user-facing
semantics only (no internal weighting formula, no scoring details).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback: keep the dereferenced int32 type of NumSockets
and NumCores instead of casting to int at the multiplication site.
Propagate int32 through selectNodeByToleranceScoring and
buildInsufficientError; cast at the nodeInfo boundary where the
resourceClient interface still uses int (Proxmox API type).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Grammar fix ("allows to allocate" -> "allows allocating") and mirror
the MemoryAdjustment example for sub-100 values (95 reserves capacity
for the host).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback: collapse the round-robin and tolerance-weighted
paths into a single algorithm. selectNode now always runs the same
curve-based scoring; when cpuAdjustment is 0 CPU is not fetched, not
enforced as a hard-fit constraint, and does not contribute to the
score, so the algorithm degrades to memory-only curve scoring without
a separate code path.

- Remove selectNodeByRoundRobin and the sortByReplicas/ScheduledVMs
  helpers; selectNode owns the entire scheduling logic.
- Move pending inflation from the hard-fit check into the score
  calculation only, so up-to-date Proxmox availability data is never
  double-counted; the score still uses pending inflation as a soft
  anti-clumping signal.
- Restore the original code comments that lived in selectNode upstream
  (convert to bytes, count placements per node, log construction).
- Update cpuAdjustment godoc to describe behaviour without referring to
  a separate algorithm or to "round-robin".

The legacy round-robin tests are updated to reflect the new behaviour.
Cumulative placement tallies and InsufficientMemoryError trigger points
are unchanged; only the order within a pass differs because the new
algo picks by memory headroom score rather than by replica count first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
68.3% Coverage on New Code (required ≥ 81%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants