feat: Capabilities should match between InstanceType and Machine when creating or updating Instance#387
feat: Capabilities should match between InstanceType and Machine when creating or updating Instance#387hwadekar-nv wants to merge 8 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughCreate and Update instance flows now perform capability-compatibility checks between the selected Machine and the InstanceType. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Handler as API Handler
participant Common as util/common
participant DB as Database
Client->>Handler: Create/Update instance request
Handler->>DB: load InstanceType, load Machine (if specified)
Handler->>Common: VerifyInstanceTypeMachineCapabilitiesMatch(ctx, logger, dbSession, InstanceTypeID, MachineID)
Common->>DB: MatchInstanceTypeCapabilitiesForMachines(...)
DB-->>Common: capabilities match / mismatch
alt capabilities match
Common-->>Handler: nil
Handler->>DB: proceed to allocate/update instance
DB-->>Handler: success
Handler-->>Client: 200/201 OK
else capabilities mismatch or matcher error
Common-->>Handler: *cutil.APIError (400 or matcher error)
Handler-->>Client: error response (propagated API error)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-04-14 19:08:53 UTC | Commit: 3832749 |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
api/pkg/api/handler/instance.go (1)
2485-2496: Consider relocating capability validation earlier in the update flow.The capability check is currently positioned after extensive interface validation (lines 2271–2481). If the capability mismatch is detected, all preceding validation work is discarded. Moving this check immediately after the
machineassignment at line 2174 would provide a fail-fast behavior, reducing unnecessary computation for requests that will ultimately be rejected.Additionally, this validation executes for all update requests where
instance.InstanceTypeIDandmachineare both non-nil, including metadata-only updates (name, description, labels). If capability drift between Instance Type and Machine is not expected post-creation, consider whether this check is warranted on every update or only when specific fields (e.g., interfaces) are modified.♻️ Suggested relocation for fail-fast behavior
Move the capability check to immediately after line 2174 where
machineis assigned:tenant := instance.Tenant site := instance.Site vpc := instance.Vpc machine := instance.Machine +// Verify here if Instance Type and Machine capabilities match +if instance.InstanceTypeID != nil && machine != nil { + isMatch, _, apiErr := common.MatchInstanceTypeCapabilitiesForMachines(ctx, logger, uih.dbSession, *instance.InstanceTypeID, []string{machine.ID}) + if apiErr != nil { + return cutil.NewAPIErrorResponse(c, apiErr.Code, apiErr.Message, apiErr.Data) + } + + if !isMatch { + return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("Capabilities for Machine: %v do not match Instance Type's Capabilities", machine.ID), nil) + } +} + // Confirm that the Instance's org matches the org sent in the requestThen remove the duplicate block at lines 2485–2496.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/instance.go` around lines 2485 - 2496, Relocate the capability validation (call to common.MatchInstanceTypeCapabilitiesForMachines) to run immediately after the machine is resolved (after the assignment to machine) so it fails fast; ensure you keep the same parameters (ctx, logger, uih.dbSession, *instance.InstanceTypeID, []string{machine.ID}) and return the same API error responses on failure. Also gate this check so it only runs when it matters (e.g., when InstanceTypeID or machine or interface-related fields are being changed in the update request) rather than for metadata-only updates, and remove the duplicate block currently present later in the function (the block that checks instance.InstanceTypeID != nil && machine != nil around the previous MatchInstanceTypeCapabilitiesForMachines call).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@api/pkg/api/handler/instance.go`:
- Around line 2485-2496: Relocate the capability validation (call to
common.MatchInstanceTypeCapabilitiesForMachines) to run immediately after the
machine is resolved (after the assignment to machine) so it fails fast; ensure
you keep the same parameters (ctx, logger, uih.dbSession,
*instance.InstanceTypeID, []string{machine.ID}) and return the same API error
responses on failure. Also gate this check so it only runs when it matters
(e.g., when InstanceTypeID or machine or interface-related fields are being
changed in the update request) rather than for metadata-only updates, and remove
the duplicate block currently present later in the function (the block that
checks instance.InstanceTypeID != nil && machine != nil around the previous
MatchInstanceTypeCapabilitiesForMachines call).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7f16406f-ff24-400d-862d-67e5e6a8e861
📒 Files selected for processing (2)
api/pkg/api/handler/instance.goapi/pkg/api/handler/instance_test.go
Test Results6 546 tests - 2 123 6 545 ✅ - 2 124 6m 24s ⏱️ - 2m 24s For more details on these errors, see this check. Results for commit 1553775. ± Comparison against base commit d4e1638. This pull request removes 2128 and adds 5 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
|
Did the caps of a machine change after it was associated with an instance type and trigger a bug report or something? Would be good to add more details to the description of the PR since we already compare machine caps and instance type caps when associating machines with instance types. This PR is clearly trying to catch something. I'd add the details to the description. |
|
Thanks @bcavnvidia, updated the description, as we saw in the past when creating an instance fails at CORE due to InactiveDevices. However, the fix also good to check other capabilities as well. |
61268ac to
d85ba62
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
api/pkg/api/handler/instance_test.go (1)
3989-3993: Extract repeated capability seeding blocks to local helpers.Lines 3989-3993, 3999-4001, 4075-4077, and 4141-4143 duplicate nearly identical capability inserts, which increases fixture drift risk.
♻️ Suggested refactor
+addNetworkDPUCap := func(m *cdbm.Machine) { + common.TestBuildMachineCapability(t, dbSession, &m.ID, nil, cdbm.MachineCapabilityTypeNetwork, + "MT42822 BlueField-2 integrated ConnectX-6 Dx network controller", nil, nil, + cdb.GetStrPtr("Mellanox Technologies"), cdb.GetIntPtr(2), cdb.GetStrPtr("DPU"), nil) +} +addNVLinkGPUCap := func(m *cdbm.Machine) { + common.TestBuildMachineCapability(t, dbSession, &m.ID, nil, cdbm.MachineCapabilityTypeGPU, + "NVIDIA GB200", nil, nil, + cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(4), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil) +} + +for _, m := range []*cdbm.Machine{mc5, mc7, mc8, mc9} { + addNetworkDPUCap(m) + addNVLinkGPUCap(m) +}As per coding guidelines: "
**/*.go: Review the Go code, point out issues relative to principles of clean code, expressiveness, and performance."Also applies to: 3999-4001, 4075-4077, 4141-4143
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/instance_test.go` around lines 3989 - 3993, The test duplicates repeated TestBuildMachineCapability calls for multiple machine IDs (e.g., mc5.ID, mc7.ID, mc8.ID, mc9.ID) with identical parameters; extract these into a local helper in instance_test.go such as seedNetworkCapability(dbSession, machineID) or seedMachineCapabilities(dbSession, []machineIDs) that calls TestBuildMachineCapability with the common args (cdbm.MachineCapabilityTypeNetwork, "MT42822 BlueField-2 integrated ConnectX-6 Dx network controller", cdb.GetStrPtr("Mellanox Technologies"), cdb.GetIntPtr(2), cdb.GetStrPtr("DPU"), etc.), then replace the repeated direct TestBuildMachineCapability invocations (the blocks around mc5.ID, mc7.ID, mc8.ID, mc9.ID and the similar groups at the other locations) with calls to this helper to reduce duplication and fixture drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/pkg/api/handler/instance_test.go`:
- Around line 3402-3463: Add explicit table-driven test cases in
instance_test.go to cover the inactiveDevices mismatch path: add one
APIInstanceCreateRequest case and one APIInstanceUpdateRequest case where the
Machine under test has an inactiveDevices list containing the interface
index(es) referenced by the request's Interfaces; use the same test table
patterns as the existing create/update tests (the test structs using
fields{dbSession, tc, cfg} and args{reqData/reqUpdate, reqMachine, reqOrg,
reqUser, respCode, respMessage}), set the Machine fixture's inactiveDevices to
include the target index, and assert the handler rejects the request with the
expected HTTP error and message indicating requested device indices are inactive
(use the same respCode/respMessage style as the other cases so verification and
verifyChildSpanner logic still applies).
In `@api/pkg/api/handler/instance.go`:
- Around line 1037-1042: The current check calling
common.ResolveInstanceTypeMachineCapabilitiesMatch(ctx, logger, cih.dbSession,
*instanceTypeID, machine.ID) returns a hard 400 when a single machine candidate
is incompatible; instead move this capability validation into the
machine-selection loop used by GetUnallocatedMachineForInstanceType(...) so each
candidate is validated and incompatible machines are skipped (and optionally
logged) rather than failing the whole create; only return an error after the
loop if no compatible unallocated machine is found for the given instanceTypeID;
update any references to instanceTypeID and machine/ machine.ID accordingly and
remove the early return in the current pre-loop location.
In `@api/pkg/api/handler/util/common/common.go`:
- Around line 955-963: ResolveInstanceTypeMachineCapabilitiesMatch currently
cannot join the caller's transaction so MatchInstanceTypeCapabilitiesForMachines
reads capabilities with nil tx, leaving a TOCTOU window; modify
ResolveInstanceTypeMachineCapabilitiesMatch signature to accept a
transaction/lock parameter (e.g., tx *cdb.Session or ctxTx) and pass it through
to MatchInstanceTypeCapabilitiesForMachines (and any downstream capability
reads) so the capability lookup runs under the caller's transaction/lock; ensure
all capability reads/locks inside MatchInstanceTypeCapabilitiesForMachines (or
its helpers) accept and use the forwarded tx instead of nil to guarantee the
check is performed under the same guard used for machine-capability updates.
---
Nitpick comments:
In `@api/pkg/api/handler/instance_test.go`:
- Around line 3989-3993: The test duplicates repeated TestBuildMachineCapability
calls for multiple machine IDs (e.g., mc5.ID, mc7.ID, mc8.ID, mc9.ID) with
identical parameters; extract these into a local helper in instance_test.go such
as seedNetworkCapability(dbSession, machineID) or
seedMachineCapabilities(dbSession, []machineIDs) that calls
TestBuildMachineCapability with the common args
(cdbm.MachineCapabilityTypeNetwork, "MT42822 BlueField-2 integrated ConnectX-6
Dx network controller", cdb.GetStrPtr("Mellanox Technologies"),
cdb.GetIntPtr(2), cdb.GetStrPtr("DPU"), etc.), then replace the repeated direct
TestBuildMachineCapability invocations (the blocks around mc5.ID, mc7.ID,
mc8.ID, mc9.ID and the similar groups at the other locations) with calls to this
helper to reduce duplication and fixture drift.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 4aec791b-885b-450a-8e10-ef7407e984fd
📒 Files selected for processing (3)
api/pkg/api/handler/instance.goapi/pkg/api/handler/instance_test.goapi/pkg/api/handler/util/common/common.go
39066e3 to
281c5e5
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
api/pkg/api/handler/util/common/common.go (1)
332-344:⚠️ Potential issue | 🟠 MajorRemove the shuffle before capability fallback.
Capability-aware auto-selection now depends on candidate order, but this still randomizes the machine list. That makes both the chosen machine and the final mismatch reported to the caller non-deterministic. Please switch this path to a stable ordering before iterating candidates.
Based on learnings, machine auto-selection must filter by capability match inside
GetUnallocatedMachineForInstanceType: iterate candidates in deterministic order, callcommon.ResolveInstanceTypeMachineCapabilitiesMatchper machine, pick the first that passes, and fall back to the next candidate on mismatch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/util/common/common.go` around lines 332 - 344, The current randomization of the machines slice (rand.Shuffle) makes capability-aware selection non-deterministic; remove the shuffle and instead enforce a stable deterministic ordering of the machines slice (e.g., sort by a stable key such as machine ID or creation timestamp) before iterating. In GetUnallocatedMachineForInstanceType, iterate the deterministically-ordered candidates and for each candidate call common.ResolveInstanceTypeMachineCapabilitiesMatch; select the first machine that returns a match and on mismatch continue to the next candidate so the function falls back deterministically. Ensure references to the machines slice and the GetUnallocatedMachineForInstanceType and ResolveInstanceTypeMachineCapabilitiesMatch symbols are updated accordingly.
♻️ Duplicate comments (1)
api/pkg/api/handler/util/common/common.go (1)
966-967:⚠️ Potential issue | 🟠 MajorCapability checks still bypass the caller transaction.
This helper delegates to
MatchInstanceTypeCapabilitiesForMachines(...), which reads capability rows withniltx. That reopens the same TOCTOU window the PR is trying to close:inactiveDevicescan change after this check and before the create/update flow reaches CORE. Please thread the caller transaction through this helper and the downstream DAO reads.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/util/common/common.go` around lines 966 - 967, VerifyInstanceTypeMachineCapabilitiesMatch currently calls MatchInstanceTypeCapabilitiesForMachines which performs DAO reads with a nil transaction, allowing a TOCTOU race; update VerifyInstanceTypeMachineCapabilitiesMatch to accept and propagate the caller transaction (e.g., pass dbSession.Tx or a *sql.Tx/tx interface) into MatchInstanceTypeCapabilitiesForMachines and update that function and any downstream DAO read helpers to use the provided tx instead of nil so capability rows are read within the caller's transaction context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/pkg/api/handler/util/common/common_test.go`:
- Around line 897-920: Add a test fixture pair that mirrors the existing
capability-match cases but where the only differing field is InactiveDevices to
exercise the inactiveDevices-only mismatch path: create a new instance type via
testCommonBuildInstanceType (e.g., instCapInactiveMismatch), register a machine
bound to it with testCommonBuildMachine (e.g., mcInactiveBad) and link with
testCommonBuildMachineInstanceType, then call TestBuildMachineCapability
twice—once for the instance type capability (nil machine ID,
&instCapInactiveMismatch.ID) and once for the machine capability
(&mcInactiveBad.ID, nil) ensuring all capability fields (Type, Name, Vendor,
DeviceType, Count, etc.) match except InactiveDevices which differs—this will
validate VerifyInstanceTypeMachineCapabilitiesMatch handles inactiveDevices-only
mismatches; follow the same pattern used for
instCapPair/instCapSingleBad/instCapTwoBad and reuse the helper names in that
block.
---
Outside diff comments:
In `@api/pkg/api/handler/util/common/common.go`:
- Around line 332-344: The current randomization of the machines slice
(rand.Shuffle) makes capability-aware selection non-deterministic; remove the
shuffle and instead enforce a stable deterministic ordering of the machines
slice (e.g., sort by a stable key such as machine ID or creation timestamp)
before iterating. In GetUnallocatedMachineForInstanceType, iterate the
deterministically-ordered candidates and for each candidate call
common.ResolveInstanceTypeMachineCapabilitiesMatch; select the first machine
that returns a match and on mismatch continue to the next candidate so the
function falls back deterministically. Ensure references to the machines slice
and the GetUnallocatedMachineForInstanceType and
ResolveInstanceTypeMachineCapabilitiesMatch symbols are updated accordingly.
---
Duplicate comments:
In `@api/pkg/api/handler/util/common/common.go`:
- Around line 966-967: VerifyInstanceTypeMachineCapabilitiesMatch currently
calls MatchInstanceTypeCapabilitiesForMachines which performs DAO reads with a
nil transaction, allowing a TOCTOU race; update
VerifyInstanceTypeMachineCapabilitiesMatch to accept and propagate the caller
transaction (e.g., pass dbSession.Tx or a *sql.Tx/tx interface) into
MatchInstanceTypeCapabilitiesForMachines and update that function and any
downstream DAO read helpers to use the provided tx instead of nil so capability
rows are read within the caller's transaction context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: ac4f5ff8-326a-41c0-8233-3139a8620dff
📒 Files selected for processing (5)
api/pkg/api/handler/instance.goapi/pkg/api/handler/instance_test.goapi/pkg/api/handler/util/common/common.goapi/pkg/api/handler/util/common/common_test.goapi/pkg/api/model/instance.go
🚧 Files skipped from review as they are similar to previous changes (2)
- api/pkg/api/handler/instance_test.go
- api/pkg/api/handler/instance.go
| // Instance types + machines for VerifyInstanceTypeMachineCapabilitiesMatch in GetUnallocatedMachineForInstanceType | ||
| // (skip machines that fail the check while more candidates exist; return API error on last failure). | ||
| instCapPair := testCommonBuildInstanceType(t, dbSession, "it-cap-pair", site1, ip, tnuser) | ||
| TestBuildMachineCapability(t, dbSession, nil, &instCapPair.ID, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-UNALLOC", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(4), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil) | ||
| mcCapBad := testCommonBuildMachine(t, dbSession, ip.ID, site1.ID, cdb.GetUUIDPtr(instCapPair.ID), uuid.New(), nil, nil, nil, cdbm.MachineStatusReady) | ||
| testCommonBuildMachineInstanceType(t, dbSession, mcCapBad.ID, instCapPair.ID) | ||
| TestBuildMachineCapability(t, dbSession, &mcCapBad.ID, nil, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-UNALLOC", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(2), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil) | ||
| mcCapGood := testCommonBuildMachine(t, dbSession, ip.ID, site1.ID, cdb.GetUUIDPtr(instCapPair.ID), uuid.New(), nil, nil, nil, cdbm.MachineStatusReady) | ||
| testCommonBuildMachineInstanceType(t, dbSession, mcCapGood.ID, instCapPair.ID) | ||
| TestBuildMachineCapability(t, dbSession, &mcCapGood.ID, nil, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-UNALLOC", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(4), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil) | ||
|
|
||
| instCapSingleBad := testCommonBuildInstanceType(t, dbSession, "it-cap-single-bad", site1, ip, tnuser) | ||
| TestBuildMachineCapability(t, dbSession, nil, &instCapSingleBad.ID, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-SINGLE", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(4), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil) | ||
| mcCapOnlyBad := testCommonBuildMachine(t, dbSession, ip.ID, site1.ID, cdb.GetUUIDPtr(instCapSingleBad.ID), uuid.New(), nil, nil, nil, cdbm.MachineStatusReady) | ||
| testCommonBuildMachineInstanceType(t, dbSession, mcCapOnlyBad.ID, instCapSingleBad.ID) | ||
| TestBuildMachineCapability(t, dbSession, &mcCapOnlyBad.ID, nil, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-SINGLE", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(1), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil) | ||
|
|
||
| instCapTwoBad := testCommonBuildInstanceType(t, dbSession, "it-cap-two-bad", site1, ip, tnuser) | ||
| TestBuildMachineCapability(t, dbSession, nil, &instCapTwoBad.ID, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-TWO", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(4), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil) | ||
| for range 2 { | ||
| mcTwoBad := testCommonBuildMachine(t, dbSession, ip.ID, site1.ID, cdb.GetUUIDPtr(instCapTwoBad.ID), uuid.New(), nil, nil, nil, cdbm.MachineStatusReady) | ||
| testCommonBuildMachineInstanceType(t, dbSession, mcTwoBad.ID, instCapTwoBad.ID) | ||
| TestBuildMachineCapability(t, dbSession, &mcTwoBad.ID, nil, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-TWO", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(2), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil) | ||
| } |
There was a problem hiding this comment.
Add an inactiveDevices-only mismatch case here.
These new fixtures only prove the fallback logic for a Count mismatch. The regression in the PR description is specifically about inactiveDevices, so this suite should also include a case where all other capability fields match and only InactiveDevices differs.
Also applies to: 922-954
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/pkg/api/handler/util/common/common_test.go` around lines 897 - 920, Add a
test fixture pair that mirrors the existing capability-match cases but where the
only differing field is InactiveDevices to exercise the inactiveDevices-only
mismatch path: create a new instance type via testCommonBuildInstanceType (e.g.,
instCapInactiveMismatch), register a machine bound to it with
testCommonBuildMachine (e.g., mcInactiveBad) and link with
testCommonBuildMachineInstanceType, then call TestBuildMachineCapability
twice—once for the instance type capability (nil machine ID,
&instCapInactiveMismatch.ID) and once for the machine capability
(&mcInactiveBad.ID, nil) ensuring all capability fields (Type, Name, Vendor,
DeviceType, Count, etc.) match except InactiveDevices which differs—this will
validate VerifyInstanceTypeMachineCapabilitiesMatch handles inactiveDevices-only
mismatches; follow the same pattern used for
instCapPair/instCapSingleBad/instCapTwoBad and reuse the helper names in that
block.
281c5e5 to
8807b32
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
api/pkg/api/handler/util/common/common.go (1)
346-394:⚠️ Potential issue | 🟠 MajorSeparate capability mismatches from operational failures.
lastCapErrcurrently records every API error coming back from capability validation and can be returned after later candidates failed for unrelated reasons such as lock contention, refresh failure, or update failure. That makes a transient backend problem look like a user-facing 400, and it can also mask a real 5xx from the capability read path. Please only retain the explicit “capabilities do not match” case here, and return immediately for operational/API failures from the validator.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/util/common/common.go` around lines 346 - 394, The loop currently stores every API error from VerifyInstanceTypeMachineCapabilitiesMatch into lastCapErr and continues, which conflates capability-mismatch (client) errors with operational/server errors; change the handling of the return value from VerifyInstanceTypeMachineCapabilitiesMatch(ctx, logger, dbSession, instancetype.ID, mc.ID) so that you only set lastCapErr and continue when the error represents an explicit capability-mismatch (e.g., a client/400 or specific capability error code/flag on the returned *cutil.APIError), but for any other API/operational failures (server errors, timeouts, unexpected APIError kinds) return immediately (return nil, apiErr, nil); keep the rest of the loop (locks via tx.TryAcquireAdvisoryLock, re-fetch via mcDAO.GetByID, and update via mcDAO.Update) unchanged.
♻️ Duplicate comments (2)
api/pkg/api/handler/util/common/common_test.go (1)
897-920:⚠️ Potential issue | 🟡 MinorAdd an
InactiveDevices-only mismatch case.The new fixtures only cover count-based mismatches, but the regression described in the PR is specifically
inactiveDeviceschanging after submission. Please add a case where every other capability field matches and onlyInactiveDevicesdiffers, then assert the same rejection path.Also applies to: 922-954
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/util/common/common_test.go` around lines 897 - 920, Add a new test fixture similar to instCapSingleBad / mcCapOnlyBad but where all capability fields built by TestBuildMachineCapability match except InactiveDevices differs (e.g., required InactiveDevices X on instance type and actual machine capability has a different InactiveDevices Y). Use testCommonBuildInstanceType to create the instance type, testCommonBuildMachine to create the machine, testCommonBuildMachineInstanceType to bind them, and TestBuildMachineCapability to set the differing InactiveDevices on the machine capability; then assert the same rejection path as the other mismatch cases (the GetUnallocatedMachineForInstanceType flow that skips machines and returns API error on last failure).api/pkg/api/handler/util/common/common.go (1)
966-967:⚠️ Potential issue | 🟠 MajorCapability validation still runs outside the caller’s transaction.
This helper still cannot join the allocation transaction, so
MatchInstanceTypeCapabilitiesForMachinesperforms its reads withniltx while machine selection is happening under a separate lock. That leaves the same capability TOCTOU window this change is intended to close.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/util/common/common.go` around lines 966 - 967, VerifyInstanceTypeMachineCapabilitiesMatch currently calls MatchInstanceTypeCapabilitiesForMachines without joining the caller's allocation transaction, causing capability checks to run outside the transactional context; change the helper to accept and use the caller's transaction/session (e.g., a transactional *cdb.Session or tx parameter) and pass that transactional context into MatchInstanceTypeCapabilitiesForMachines so the capability reads occur within the same transaction/lock as allocation; update function signature of VerifyInstanceTypeMachineCapabilitiesMatch and all callers accordingly and ensure MatchInstanceTypeCapabilitiesForMachines reads use the provided transaction/session.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/pkg/api/handler/util/common/common_test.go`:
- Around line 938-942: The test case "success when one machine fails capability
match but another machine matches" currently only asserts no error; update it to
also assert that the returned machine ID equals mcCapGood.ID (from instCapPair
scenario) to ensure mcCapBad was skipped despite candidate shuffle; locate the
call under test (the function that returns the selected machine in this test)
and add an assertion comparing the returned machine's ID to mcCapGood.ID. Apply
the same change to the similar cases around lines 958-970 that use instCapPair
and mcCapBad/mcCapGood to verify selection rather than just success.
---
Outside diff comments:
In `@api/pkg/api/handler/util/common/common.go`:
- Around line 346-394: The loop currently stores every API error from
VerifyInstanceTypeMachineCapabilitiesMatch into lastCapErr and continues, which
conflates capability-mismatch (client) errors with operational/server errors;
change the handling of the return value from
VerifyInstanceTypeMachineCapabilitiesMatch(ctx, logger, dbSession,
instancetype.ID, mc.ID) so that you only set lastCapErr and continue when the
error represents an explicit capability-mismatch (e.g., a client/400 or specific
capability error code/flag on the returned *cutil.APIError), but for any other
API/operational failures (server errors, timeouts, unexpected APIError kinds)
return immediately (return nil, apiErr, nil); keep the rest of the loop (locks
via tx.TryAcquireAdvisoryLock, re-fetch via mcDAO.GetByID, and update via
mcDAO.Update) unchanged.
---
Duplicate comments:
In `@api/pkg/api/handler/util/common/common_test.go`:
- Around line 897-920: Add a new test fixture similar to instCapSingleBad /
mcCapOnlyBad but where all capability fields built by TestBuildMachineCapability
match except InactiveDevices differs (e.g., required InactiveDevices X on
instance type and actual machine capability has a different InactiveDevices Y).
Use testCommonBuildInstanceType to create the instance type,
testCommonBuildMachine to create the machine, testCommonBuildMachineInstanceType
to bind them, and TestBuildMachineCapability to set the differing
InactiveDevices on the machine capability; then assert the same rejection path
as the other mismatch cases (the GetUnallocatedMachineForInstanceType flow that
skips machines and returns API error on last failure).
In `@api/pkg/api/handler/util/common/common.go`:
- Around line 966-967: VerifyInstanceTypeMachineCapabilitiesMatch currently
calls MatchInstanceTypeCapabilitiesForMachines without joining the caller's
allocation transaction, causing capability checks to run outside the
transactional context; change the helper to accept and use the caller's
transaction/session (e.g., a transactional *cdb.Session or tx parameter) and
pass that transactional context into MatchInstanceTypeCapabilitiesForMachines so
the capability reads occur within the same transaction/lock as allocation;
update function signature of VerifyInstanceTypeMachineCapabilitiesMatch and all
callers accordingly and ensure MatchInstanceTypeCapabilitiesForMachines reads
use the provided transaction/session.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9298b89a-77b3-4ea4-978f-fcbeef7292fe
📒 Files selected for processing (5)
api/pkg/api/handler/instance.goapi/pkg/api/handler/instance_test.goapi/pkg/api/handler/util/common/common.goapi/pkg/api/handler/util/common/common_test.goapi/pkg/api/model/instance.go
🚧 Files skipped from review as they are similar to previous changes (2)
- api/pkg/api/handler/instance.go
- api/pkg/api/handler/instance_test.go
| { | ||
| name: "success when one machine fails capability match but another machine matches", | ||
| instancetype: instCapPair, | ||
| expectErr: false, | ||
| }, |
There was a problem hiding this comment.
Assert the selected machine, not just success.
This case only checks that the call succeeds. Because candidate order is shuffled, the test still passes if mcCapBad is allocated and the skip logic regresses. Please assert that the returned machine is mcCapGood.ID so the test proves the mismatching candidate was actually skipped.
Also applies to: 958-970
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/pkg/api/handler/util/common/common_test.go` around lines 938 - 942, The
test case "success when one machine fails capability match but another machine
matches" currently only asserts no error; update it to also assert that the
returned machine ID equals mcCapGood.ID (from instCapPair scenario) to ensure
mcCapBad was skipped despite candidate shuffle; locate the call under test (the
function that returns the selected machine in this test) and add an assertion
comparing the returned machine's ID to mcCapGood.ID. Apply the same change to
the similar cases around lines 958-970 that use instCapPair and
mcCapBad/mcCapGood to verify selection rather than just success.
8807b32 to
78a740e
Compare
… creating or updating Instance
78a740e to
ce9c002
Compare
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
|
/ok test 736bfbe |
|
We don't want to block the user unless they've specified a requirement that we can't meet with any of the Machines in the pool. So it won't be a check of whether the Machine is compliant with Instance Type, rather whether the Machine can satisfy the user request. |
|
Thanks @thossain-nv. Since the user will be blocked if no machine matches the InstanceType capabilities, we can hold off merging this PR. |
Description
Root Cause
Recommendation / Fix to prevent this issue:
Type of Change
Services Affected
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes