Skip to content

feat: Capabilities should match between InstanceType and Machine when creating or updating Instance#387

Open
hwadekar-nv wants to merge 8 commits intomainfrom
feat/instance-type-machine-cap
Open

feat: Capabilities should match between InstanceType and Machine when creating or updating Instance#387
hwadekar-nv wants to merge 8 commits intomainfrom
feat/instance-type-machine-cap

Conversation

@hwadekar-nv
Copy link
Copy Markdown
Contributor

@hwadekar-nv hwadekar-nv commented Apr 14, 2026

Description

Root Cause

  • In the InfiniBand scenario, inventory updates modified the inactiveDevices field in the Machine capabilities after the request was submitted.
  • The instance creation request attempted to provision InfiniBand interfaces on those (now inactive) ports, leading to a mismatch.
  • As a result, CORE discarded the request and raised an error.

Recommendation / Fix to prevent this issue:

  • Ensure Machine capabilities and InstanceType capabilities are validated and aligned before sending the request to CORE.
  • Specifically, validate fields like inactiveDevices to confirm required interfaces are active and usable.

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • RLA - RLA service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@hwadekar-nv hwadekar-nv self-assigned this Apr 14, 2026
@hwadekar-nv hwadekar-nv requested a review from a team as a code owner April 14, 2026 19:08
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Create and Update instance flows now perform capability-compatibility checks between the selected Machine and the InstanceType. GetUnallocatedMachineForInstanceType was changed to accept a logger and may return an API error when no candidate machine matches capabilities. Update requests skip capability validation for metadata-only updates.

Changes

Cohort / File(s) Summary
Instance handlers
api/pkg/api/handler/instance.go
Create handler performs early VerifyInstanceTypeMachineCapabilitiesMatch(...) when a specific Machine (with InstanceTypeID) is supplied; call to GetUnallocatedMachineForInstanceType updated to pass logger and handle returned *cutil.APIError. Update handler invokes capability validation only when apiRequest.NeedsCapabilityValidation() is true and both instance.InstanceTypeID and machine exist; failures return the API error.
Handler tests
api/pkg/api/handler/instance_test.go
Added assertMachineInstanceTypeAssociation and testAddMachineCapabilitiesMatchingIST1. Expanded fixtures to cover matching, mismatching, and missing capability scenarios; strengthened assertions on MachineInstanceType rows; added PATCH JSON override support to force explicit interface payloads; added update error cases for capability mismatches and inactive IB device index.
Utility capability matcher
api/pkg/api/handler/util/common/common.go
GetUnallocatedMachineForInstanceType signature changed to accept logger zerolog.Logger and return (*cdbm.Machine, *cutil.APIError, error). Candidate machines failing capability checks are skipped and the last capability-related *cutil.APIError is retained and returned if no machine is allocatable. Added exported VerifyInstanceTypeMachineCapabilitiesMatch(...) wrapper that calls the matcher and returns appropriate *cutil.APIError for mismatches.
Utility tests
api/pkg/api/handler/util/common/common_test.go
Updated tests to pass a zerolog logger and assert both regular error and returned apiErr. Added fixtures for scenarios where some or all candidate machines fail capability matching and assert API error content when appropriate.
Instance model
api/pkg/api/model/instance.go
Added (*APIInstanceUpdateRequest) NeedsCapabilityValidation() bool to indicate when updates require machine/instance-type capability validation (interface changes or SecondaryVpcIDs present).

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Handler as API Handler
  participant Common as util/common
  participant DB as Database
  Client->>Handler: Create/Update instance request
  Handler->>DB: load InstanceType, load Machine (if specified)
  Handler->>Common: VerifyInstanceTypeMachineCapabilitiesMatch(ctx, logger, dbSession, InstanceTypeID, MachineID)
  Common->>DB: MatchInstanceTypeCapabilitiesForMachines(...)
  DB-->>Common: capabilities match / mismatch
  alt capabilities match
    Common-->>Handler: nil
    Handler->>DB: proceed to allocate/update instance
    DB-->>Handler: success
    Handler-->>Client: 200/201 OK
  else capabilities mismatch or matcher error
    Common-->>Handler: *cutil.APIError (400 or matcher error)
    Handler-->>Client: error response (propagated API error)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the primary change: introducing capability validation between InstanceType and Machine during instance creation and updates.
Description check ✅ Passed The description clearly relates to the changeset, explaining the root cause (InfiniBand capability mismatch) and the fix (capability validation before CORE processing).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/instance-type-machine-cap

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-04-14 19:08:53 UTC | Commit: 3832749

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
api/pkg/api/handler/instance.go (1)

2485-2496: Consider relocating capability validation earlier in the update flow.

The capability check is currently positioned after extensive interface validation (lines 2271–2481). If the capability mismatch is detected, all preceding validation work is discarded. Moving this check immediately after the machine assignment at line 2174 would provide a fail-fast behavior, reducing unnecessary computation for requests that will ultimately be rejected.

Additionally, this validation executes for all update requests where instance.InstanceTypeID and machine are both non-nil, including metadata-only updates (name, description, labels). If capability drift between Instance Type and Machine is not expected post-creation, consider whether this check is warranted on every update or only when specific fields (e.g., interfaces) are modified.

♻️ Suggested relocation for fail-fast behavior

Move the capability check to immediately after line 2174 where machine is assigned:

 tenant := instance.Tenant
 site := instance.Site
 vpc := instance.Vpc
 machine := instance.Machine

+// Verify here if Instance Type and Machine capabilities match
+if instance.InstanceTypeID != nil && machine != nil {
+    isMatch, _, apiErr := common.MatchInstanceTypeCapabilitiesForMachines(ctx, logger, uih.dbSession, *instance.InstanceTypeID, []string{machine.ID})
+    if apiErr != nil {
+        return cutil.NewAPIErrorResponse(c, apiErr.Code, apiErr.Message, apiErr.Data)
+    }
+
+    if !isMatch {
+        return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("Capabilities for Machine: %v do not match Instance Type's Capabilities", machine.ID), nil)
+    }
+}
+
 // Confirm that the Instance's org matches the org sent in the request

Then remove the duplicate block at lines 2485–2496.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instance.go` around lines 2485 - 2496, Relocate the
capability validation (call to common.MatchInstanceTypeCapabilitiesForMachines)
to run immediately after the machine is resolved (after the assignment to
machine) so it fails fast; ensure you keep the same parameters (ctx, logger,
uih.dbSession, *instance.InstanceTypeID, []string{machine.ID}) and return the
same API error responses on failure. Also gate this check so it only runs when
it matters (e.g., when InstanceTypeID or machine or interface-related fields are
being changed in the update request) rather than for metadata-only updates, and
remove the duplicate block currently present later in the function (the block
that checks instance.InstanceTypeID != nil && machine != nil around the previous
MatchInstanceTypeCapabilitiesForMachines call).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@api/pkg/api/handler/instance.go`:
- Around line 2485-2496: Relocate the capability validation (call to
common.MatchInstanceTypeCapabilitiesForMachines) to run immediately after the
machine is resolved (after the assignment to machine) so it fails fast; ensure
you keep the same parameters (ctx, logger, uih.dbSession,
*instance.InstanceTypeID, []string{machine.ID}) and return the same API error
responses on failure. Also gate this check so it only runs when it matters
(e.g., when InstanceTypeID or machine or interface-related fields are being
changed in the update request) rather than for metadata-only updates, and remove
the duplicate block currently present later in the function (the block that
checks instance.InstanceTypeID != nil && machine != nil around the previous
MatchInstanceTypeCapabilitiesForMachines call).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7f16406f-ff24-400d-862d-67e5e6a8e861

📥 Commits

Reviewing files that changed from the base of the PR and between fcf7893 and 3832749.

📒 Files selected for processing (2)
  • api/pkg/api/handler/instance.go
  • api/pkg/api/handler/instance_test.go

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 14, 2026

Test Results

6 546 tests   - 2 123   6 545 ✅  - 2 124   6m 24s ⏱️ - 2m 24s
  144 suites +    1       0 💤 ±    0 
   14 files   ±    0       0 ❌ ±    0   1 🔥 +1 

For more details on these errors, see this check.

Results for commit 1553775. ± Comparison against base commit d4e1638.

This pull request removes 2128 and adds 5 tests. Note that renamed tests count towards both.
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ TestAllocationConstraintHandler_Update
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ TestAllocationConstraintHandler_Update/error_update_IPBlock_allocation_constraint_value,_ipam_error
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ TestAllocationConstraintHandler_Update/error_when_allocation_attached_to_subnet_
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ TestAllocationConstraintHandler_Update/error_when_machines_not_available_for_updating_allocation_constraint
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ TestAllocationConstraintHandler_Update/error_when_reqBody_doesnt_bind
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ TestAllocationConstraintHandler_Update/error_when_specified_allocation_and_allocation_constraint_doesnt_match
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ TestAllocationConstraintHandler_Update/error_when_specified_allocation_constraint_doesnt_exist
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ TestAllocationConstraintHandler_Update/error_when_specified_allocation_constraint_id_is_invalid_uuid
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ TestAllocationConstraintHandler_Update/error_when_specified_org_does_not_have_infrastructure_provider
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ TestAllocationConstraintHandler_Update/error_when_specified_org_does_not_have_infrastructure_provider_matching_the_one_in_allocation
…
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler ‑ [build failed]
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler/util/common ‑ TestGetUnallocatedMachineForInstanceType/NotAcceptable_when_every_candidate_machine_fails_capability_match
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler/util/common ‑ TestGetUnallocatedMachineForInstanceType/NotAcceptable_when_only_candidate_machine_fails_capability_match
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler/util/common ‑ TestGetUnallocatedMachineForInstanceType/error_when_instance_type_is_empty
github.com/NVIDIA/ncx-infra-controller-rest/api/pkg/api/handler/util/common ‑ TestGetUnallocatedMachineForInstanceType/success_when_one_machine_fails_capability_match_but_another_machine_matches

♻️ This comment has been updated with latest results.

@bcavnvidia
Copy link
Copy Markdown
Contributor

Did the caps of a machine change after it was associated with an instance type and trigger a bug report or something?

Would be good to add more details to the description of the PR since we already compare machine caps and instance type caps when associating machines with instance types. This PR is clearly trying to catch something. I'd add the details to the description.

Comment thread api/pkg/api/handler/instance.go Outdated
@hwadekar-nv
Copy link
Copy Markdown
Contributor Author

Thanks @bcavnvidia, updated the description, as we saw in the past when creating an instance fails at CORE due to InactiveDevices. However, the fix also good to check other capabilities as well.

@hwadekar-nv hwadekar-nv force-pushed the feat/instance-type-machine-cap branch from 61268ac to d85ba62 Compare April 15, 2026 17:59
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
api/pkg/api/handler/instance_test.go (1)

3989-3993: Extract repeated capability seeding blocks to local helpers.

Lines 3989-3993, 3999-4001, 4075-4077, and 4141-4143 duplicate nearly identical capability inserts, which increases fixture drift risk.

♻️ Suggested refactor
+addNetworkDPUCap := func(m *cdbm.Machine) {
+	common.TestBuildMachineCapability(t, dbSession, &m.ID, nil, cdbm.MachineCapabilityTypeNetwork,
+		"MT42822 BlueField-2 integrated ConnectX-6 Dx network controller", nil, nil,
+		cdb.GetStrPtr("Mellanox Technologies"), cdb.GetIntPtr(2), cdb.GetStrPtr("DPU"), nil)
+}
+addNVLinkGPUCap := func(m *cdbm.Machine) {
+	common.TestBuildMachineCapability(t, dbSession, &m.ID, nil, cdbm.MachineCapabilityTypeGPU,
+		"NVIDIA GB200", nil, nil,
+		cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(4), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil)
+}
+
+for _, m := range []*cdbm.Machine{mc5, mc7, mc8, mc9} {
+	addNetworkDPUCap(m)
+	addNVLinkGPUCap(m)
+}

As per coding guidelines: "**/*.go: Review the Go code, point out issues relative to principles of clean code, expressiveness, and performance."

Also applies to: 3999-4001, 4075-4077, 4141-4143

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instance_test.go` around lines 3989 - 3993, The test
duplicates repeated TestBuildMachineCapability calls for multiple machine IDs
(e.g., mc5.ID, mc7.ID, mc8.ID, mc9.ID) with identical parameters; extract these
into a local helper in instance_test.go such as seedNetworkCapability(dbSession,
machineID) or seedMachineCapabilities(dbSession, []machineIDs) that calls
TestBuildMachineCapability with the common args
(cdbm.MachineCapabilityTypeNetwork, "MT42822 BlueField-2 integrated ConnectX-6
Dx network controller", cdb.GetStrPtr("Mellanox Technologies"),
cdb.GetIntPtr(2), cdb.GetStrPtr("DPU"), etc.), then replace the repeated direct
TestBuildMachineCapability invocations (the blocks around mc5.ID, mc7.ID,
mc8.ID, mc9.ID and the similar groups at the other locations) with calls to this
helper to reduce duplication and fixture drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pkg/api/handler/instance_test.go`:
- Around line 3402-3463: Add explicit table-driven test cases in
instance_test.go to cover the inactiveDevices mismatch path: add one
APIInstanceCreateRequest case and one APIInstanceUpdateRequest case where the
Machine under test has an inactiveDevices list containing the interface
index(es) referenced by the request's Interfaces; use the same test table
patterns as the existing create/update tests (the test structs using
fields{dbSession, tc, cfg} and args{reqData/reqUpdate, reqMachine, reqOrg,
reqUser, respCode, respMessage}), set the Machine fixture's inactiveDevices to
include the target index, and assert the handler rejects the request with the
expected HTTP error and message indicating requested device indices are inactive
(use the same respCode/respMessage style as the other cases so verification and
verifyChildSpanner logic still applies).

In `@api/pkg/api/handler/instance.go`:
- Around line 1037-1042: The current check calling
common.ResolveInstanceTypeMachineCapabilitiesMatch(ctx, logger, cih.dbSession,
*instanceTypeID, machine.ID) returns a hard 400 when a single machine candidate
is incompatible; instead move this capability validation into the
machine-selection loop used by GetUnallocatedMachineForInstanceType(...) so each
candidate is validated and incompatible machines are skipped (and optionally
logged) rather than failing the whole create; only return an error after the
loop if no compatible unallocated machine is found for the given instanceTypeID;
update any references to instanceTypeID and machine/ machine.ID accordingly and
remove the early return in the current pre-loop location.

In `@api/pkg/api/handler/util/common/common.go`:
- Around line 955-963: ResolveInstanceTypeMachineCapabilitiesMatch currently
cannot join the caller's transaction so MatchInstanceTypeCapabilitiesForMachines
reads capabilities with nil tx, leaving a TOCTOU window; modify
ResolveInstanceTypeMachineCapabilitiesMatch signature to accept a
transaction/lock parameter (e.g., tx *cdb.Session or ctxTx) and pass it through
to MatchInstanceTypeCapabilitiesForMachines (and any downstream capability
reads) so the capability lookup runs under the caller's transaction/lock; ensure
all capability reads/locks inside MatchInstanceTypeCapabilitiesForMachines (or
its helpers) accept and use the forwarded tx instead of nil to guarantee the
check is performed under the same guard used for machine-capability updates.

---

Nitpick comments:
In `@api/pkg/api/handler/instance_test.go`:
- Around line 3989-3993: The test duplicates repeated TestBuildMachineCapability
calls for multiple machine IDs (e.g., mc5.ID, mc7.ID, mc8.ID, mc9.ID) with
identical parameters; extract these into a local helper in instance_test.go such
as seedNetworkCapability(dbSession, machineID) or
seedMachineCapabilities(dbSession, []machineIDs) that calls
TestBuildMachineCapability with the common args
(cdbm.MachineCapabilityTypeNetwork, "MT42822 BlueField-2 integrated ConnectX-6
Dx network controller", cdb.GetStrPtr("Mellanox Technologies"),
cdb.GetIntPtr(2), cdb.GetStrPtr("DPU"), etc.), then replace the repeated direct
TestBuildMachineCapability invocations (the blocks around mc5.ID, mc7.ID,
mc8.ID, mc9.ID and the similar groups at the other locations) with calls to this
helper to reduce duplication and fixture drift.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4aec791b-885b-450a-8e10-ef7407e984fd

📥 Commits

Reviewing files that changed from the base of the PR and between 3832749 and 61268ac.

📒 Files selected for processing (3)
  • api/pkg/api/handler/instance.go
  • api/pkg/api/handler/instance_test.go
  • api/pkg/api/handler/util/common/common.go

Comment thread api/pkg/api/handler/instance_test.go
Comment thread api/pkg/api/handler/instance.go Outdated
Comment thread api/pkg/api/handler/util/common/common.go Outdated
@hwadekar-nv hwadekar-nv force-pushed the feat/instance-type-machine-cap branch from 39066e3 to 281c5e5 Compare April 16, 2026 00:01
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/pkg/api/handler/util/common/common.go (1)

332-344: ⚠️ Potential issue | 🟠 Major

Remove the shuffle before capability fallback.

Capability-aware auto-selection now depends on candidate order, but this still randomizes the machine list. That makes both the chosen machine and the final mismatch reported to the caller non-deterministic. Please switch this path to a stable ordering before iterating candidates.

Based on learnings, machine auto-selection must filter by capability match inside GetUnallocatedMachineForInstanceType: iterate candidates in deterministic order, call common.ResolveInstanceTypeMachineCapabilitiesMatch per machine, pick the first that passes, and fall back to the next candidate on mismatch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/util/common/common.go` around lines 332 - 344, The
current randomization of the machines slice (rand.Shuffle) makes
capability-aware selection non-deterministic; remove the shuffle and instead
enforce a stable deterministic ordering of the machines slice (e.g., sort by a
stable key such as machine ID or creation timestamp) before iterating. In
GetUnallocatedMachineForInstanceType, iterate the deterministically-ordered
candidates and for each candidate call
common.ResolveInstanceTypeMachineCapabilitiesMatch; select the first machine
that returns a match and on mismatch continue to the next candidate so the
function falls back deterministically. Ensure references to the machines slice
and the GetUnallocatedMachineForInstanceType and
ResolveInstanceTypeMachineCapabilitiesMatch symbols are updated accordingly.
♻️ Duplicate comments (1)
api/pkg/api/handler/util/common/common.go (1)

966-967: ⚠️ Potential issue | 🟠 Major

Capability checks still bypass the caller transaction.

This helper delegates to MatchInstanceTypeCapabilitiesForMachines(...), which reads capability rows with nil tx. That reopens the same TOCTOU window the PR is trying to close: inactiveDevices can change after this check and before the create/update flow reaches CORE. Please thread the caller transaction through this helper and the downstream DAO reads.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/util/common/common.go` around lines 966 - 967,
VerifyInstanceTypeMachineCapabilitiesMatch currently calls
MatchInstanceTypeCapabilitiesForMachines which performs DAO reads with a nil
transaction, allowing a TOCTOU race; update
VerifyInstanceTypeMachineCapabilitiesMatch to accept and propagate the caller
transaction (e.g., pass dbSession.Tx or a *sql.Tx/tx interface) into
MatchInstanceTypeCapabilitiesForMachines and update that function and any
downstream DAO read helpers to use the provided tx instead of nil so capability
rows are read within the caller's transaction context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pkg/api/handler/util/common/common_test.go`:
- Around line 897-920: Add a test fixture pair that mirrors the existing
capability-match cases but where the only differing field is InactiveDevices to
exercise the inactiveDevices-only mismatch path: create a new instance type via
testCommonBuildInstanceType (e.g., instCapInactiveMismatch), register a machine
bound to it with testCommonBuildMachine (e.g., mcInactiveBad) and link with
testCommonBuildMachineInstanceType, then call TestBuildMachineCapability
twice—once for the instance type capability (nil machine ID,
&instCapInactiveMismatch.ID) and once for the machine capability
(&mcInactiveBad.ID, nil) ensuring all capability fields (Type, Name, Vendor,
DeviceType, Count, etc.) match except InactiveDevices which differs—this will
validate VerifyInstanceTypeMachineCapabilitiesMatch handles inactiveDevices-only
mismatches; follow the same pattern used for
instCapPair/instCapSingleBad/instCapTwoBad and reuse the helper names in that
block.

---

Outside diff comments:
In `@api/pkg/api/handler/util/common/common.go`:
- Around line 332-344: The current randomization of the machines slice
(rand.Shuffle) makes capability-aware selection non-deterministic; remove the
shuffle and instead enforce a stable deterministic ordering of the machines
slice (e.g., sort by a stable key such as machine ID or creation timestamp)
before iterating. In GetUnallocatedMachineForInstanceType, iterate the
deterministically-ordered candidates and for each candidate call
common.ResolveInstanceTypeMachineCapabilitiesMatch; select the first machine
that returns a match and on mismatch continue to the next candidate so the
function falls back deterministically. Ensure references to the machines slice
and the GetUnallocatedMachineForInstanceType and
ResolveInstanceTypeMachineCapabilitiesMatch symbols are updated accordingly.

---

Duplicate comments:
In `@api/pkg/api/handler/util/common/common.go`:
- Around line 966-967: VerifyInstanceTypeMachineCapabilitiesMatch currently
calls MatchInstanceTypeCapabilitiesForMachines which performs DAO reads with a
nil transaction, allowing a TOCTOU race; update
VerifyInstanceTypeMachineCapabilitiesMatch to accept and propagate the caller
transaction (e.g., pass dbSession.Tx or a *sql.Tx/tx interface) into
MatchInstanceTypeCapabilitiesForMachines and update that function and any
downstream DAO read helpers to use the provided tx instead of nil so capability
rows are read within the caller's transaction context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ac4f5ff8-326a-41c0-8233-3139a8620dff

📥 Commits

Reviewing files that changed from the base of the PR and between d85ba62 and 281c5e5.

📒 Files selected for processing (5)
  • api/pkg/api/handler/instance.go
  • api/pkg/api/handler/instance_test.go
  • api/pkg/api/handler/util/common/common.go
  • api/pkg/api/handler/util/common/common_test.go
  • api/pkg/api/model/instance.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • api/pkg/api/handler/instance_test.go
  • api/pkg/api/handler/instance.go

Comment on lines +897 to +920
// Instance types + machines for VerifyInstanceTypeMachineCapabilitiesMatch in GetUnallocatedMachineForInstanceType
// (skip machines that fail the check while more candidates exist; return API error on last failure).
instCapPair := testCommonBuildInstanceType(t, dbSession, "it-cap-pair", site1, ip, tnuser)
TestBuildMachineCapability(t, dbSession, nil, &instCapPair.ID, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-UNALLOC", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(4), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil)
mcCapBad := testCommonBuildMachine(t, dbSession, ip.ID, site1.ID, cdb.GetUUIDPtr(instCapPair.ID), uuid.New(), nil, nil, nil, cdbm.MachineStatusReady)
testCommonBuildMachineInstanceType(t, dbSession, mcCapBad.ID, instCapPair.ID)
TestBuildMachineCapability(t, dbSession, &mcCapBad.ID, nil, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-UNALLOC", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(2), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil)
mcCapGood := testCommonBuildMachine(t, dbSession, ip.ID, site1.ID, cdb.GetUUIDPtr(instCapPair.ID), uuid.New(), nil, nil, nil, cdbm.MachineStatusReady)
testCommonBuildMachineInstanceType(t, dbSession, mcCapGood.ID, instCapPair.ID)
TestBuildMachineCapability(t, dbSession, &mcCapGood.ID, nil, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-UNALLOC", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(4), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil)

instCapSingleBad := testCommonBuildInstanceType(t, dbSession, "it-cap-single-bad", site1, ip, tnuser)
TestBuildMachineCapability(t, dbSession, nil, &instCapSingleBad.ID, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-SINGLE", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(4), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil)
mcCapOnlyBad := testCommonBuildMachine(t, dbSession, ip.ID, site1.ID, cdb.GetUUIDPtr(instCapSingleBad.ID), uuid.New(), nil, nil, nil, cdbm.MachineStatusReady)
testCommonBuildMachineInstanceType(t, dbSession, mcCapOnlyBad.ID, instCapSingleBad.ID)
TestBuildMachineCapability(t, dbSession, &mcCapOnlyBad.ID, nil, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-SINGLE", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(1), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil)

instCapTwoBad := testCommonBuildInstanceType(t, dbSession, "it-cap-two-bad", site1, ip, tnuser)
TestBuildMachineCapability(t, dbSession, nil, &instCapTwoBad.ID, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-TWO", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(4), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil)
for range 2 {
mcTwoBad := testCommonBuildMachine(t, dbSession, ip.ID, site1.ID, cdb.GetUUIDPtr(instCapTwoBad.ID), uuid.New(), nil, nil, nil, cdbm.MachineStatusReady)
testCommonBuildMachineInstanceType(t, dbSession, mcTwoBad.ID, instCapTwoBad.ID)
TestBuildMachineCapability(t, dbSession, &mcTwoBad.ID, nil, cdbm.MachineCapabilityTypeGPU, "GPU-CAP-TWO", nil, nil, cdb.GetStrPtr("NVIDIA"), cdb.GetIntPtr(2), cdb.GetStrPtr(cdbm.MachineCapabilityDeviceTypeNVLink), nil)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add an inactiveDevices-only mismatch case here.

These new fixtures only prove the fallback logic for a Count mismatch. The regression in the PR description is specifically about inactiveDevices, so this suite should also include a case where all other capability fields match and only InactiveDevices differs.

Also applies to: 922-954

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/util/common/common_test.go` around lines 897 - 920, Add a
test fixture pair that mirrors the existing capability-match cases but where the
only differing field is InactiveDevices to exercise the inactiveDevices-only
mismatch path: create a new instance type via testCommonBuildInstanceType (e.g.,
instCapInactiveMismatch), register a machine bound to it with
testCommonBuildMachine (e.g., mcInactiveBad) and link with
testCommonBuildMachineInstanceType, then call TestBuildMachineCapability
twice—once for the instance type capability (nil machine ID,
&instCapInactiveMismatch.ID) and once for the machine capability
(&mcInactiveBad.ID, nil) ensuring all capability fields (Type, Name, Vendor,
DeviceType, Count, etc.) match except InactiveDevices which differs—this will
validate VerifyInstanceTypeMachineCapabilitiesMatch handles inactiveDevices-only
mismatches; follow the same pattern used for
instCapPair/instCapSingleBad/instCapTwoBad and reuse the helper names in that
block.

@hwadekar-nv hwadekar-nv force-pushed the feat/instance-type-machine-cap branch from 281c5e5 to 8807b32 Compare April 16, 2026 17:10
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/pkg/api/handler/util/common/common.go (1)

346-394: ⚠️ Potential issue | 🟠 Major

Separate capability mismatches from operational failures.

lastCapErr currently records every API error coming back from capability validation and can be returned after later candidates failed for unrelated reasons such as lock contention, refresh failure, or update failure. That makes a transient backend problem look like a user-facing 400, and it can also mask a real 5xx from the capability read path. Please only retain the explicit “capabilities do not match” case here, and return immediately for operational/API failures from the validator.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/util/common/common.go` around lines 346 - 394, The loop
currently stores every API error from VerifyInstanceTypeMachineCapabilitiesMatch
into lastCapErr and continues, which conflates capability-mismatch (client)
errors with operational/server errors; change the handling of the return value
from VerifyInstanceTypeMachineCapabilitiesMatch(ctx, logger, dbSession,
instancetype.ID, mc.ID) so that you only set lastCapErr and continue when the
error represents an explicit capability-mismatch (e.g., a client/400 or specific
capability error code/flag on the returned *cutil.APIError), but for any other
API/operational failures (server errors, timeouts, unexpected APIError kinds)
return immediately (return nil, apiErr, nil); keep the rest of the loop (locks
via tx.TryAcquireAdvisoryLock, re-fetch via mcDAO.GetByID, and update via
mcDAO.Update) unchanged.
♻️ Duplicate comments (2)
api/pkg/api/handler/util/common/common_test.go (1)

897-920: ⚠️ Potential issue | 🟡 Minor

Add an InactiveDevices-only mismatch case.

The new fixtures only cover count-based mismatches, but the regression described in the PR is specifically inactiveDevices changing after submission. Please add a case where every other capability field matches and only InactiveDevices differs, then assert the same rejection path.

Also applies to: 922-954

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/util/common/common_test.go` around lines 897 - 920, Add a
new test fixture similar to instCapSingleBad / mcCapOnlyBad but where all
capability fields built by TestBuildMachineCapability match except
InactiveDevices differs (e.g., required InactiveDevices X on instance type and
actual machine capability has a different InactiveDevices Y). Use
testCommonBuildInstanceType to create the instance type, testCommonBuildMachine
to create the machine, testCommonBuildMachineInstanceType to bind them, and
TestBuildMachineCapability to set the differing InactiveDevices on the machine
capability; then assert the same rejection path as the other mismatch cases (the
GetUnallocatedMachineForInstanceType flow that skips machines and returns API
error on last failure).
api/pkg/api/handler/util/common/common.go (1)

966-967: ⚠️ Potential issue | 🟠 Major

Capability validation still runs outside the caller’s transaction.

This helper still cannot join the allocation transaction, so MatchInstanceTypeCapabilitiesForMachines performs its reads with nil tx while machine selection is happening under a separate lock. That leaves the same capability TOCTOU window this change is intended to close.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/util/common/common.go` around lines 966 - 967,
VerifyInstanceTypeMachineCapabilitiesMatch currently calls
MatchInstanceTypeCapabilitiesForMachines without joining the caller's allocation
transaction, causing capability checks to run outside the transactional context;
change the helper to accept and use the caller's transaction/session (e.g., a
transactional *cdb.Session or tx parameter) and pass that transactional context
into MatchInstanceTypeCapabilitiesForMachines so the capability reads occur
within the same transaction/lock as allocation; update function signature of
VerifyInstanceTypeMachineCapabilitiesMatch and all callers accordingly and
ensure MatchInstanceTypeCapabilitiesForMachines reads use the provided
transaction/session.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pkg/api/handler/util/common/common_test.go`:
- Around line 938-942: The test case "success when one machine fails capability
match but another machine matches" currently only asserts no error; update it to
also assert that the returned machine ID equals mcCapGood.ID (from instCapPair
scenario) to ensure mcCapBad was skipped despite candidate shuffle; locate the
call under test (the function that returns the selected machine in this test)
and add an assertion comparing the returned machine's ID to mcCapGood.ID. Apply
the same change to the similar cases around lines 958-970 that use instCapPair
and mcCapBad/mcCapGood to verify selection rather than just success.

---

Outside diff comments:
In `@api/pkg/api/handler/util/common/common.go`:
- Around line 346-394: The loop currently stores every API error from
VerifyInstanceTypeMachineCapabilitiesMatch into lastCapErr and continues, which
conflates capability-mismatch (client) errors with operational/server errors;
change the handling of the return value from
VerifyInstanceTypeMachineCapabilitiesMatch(ctx, logger, dbSession,
instancetype.ID, mc.ID) so that you only set lastCapErr and continue when the
error represents an explicit capability-mismatch (e.g., a client/400 or specific
capability error code/flag on the returned *cutil.APIError), but for any other
API/operational failures (server errors, timeouts, unexpected APIError kinds)
return immediately (return nil, apiErr, nil); keep the rest of the loop (locks
via tx.TryAcquireAdvisoryLock, re-fetch via mcDAO.GetByID, and update via
mcDAO.Update) unchanged.

---

Duplicate comments:
In `@api/pkg/api/handler/util/common/common_test.go`:
- Around line 897-920: Add a new test fixture similar to instCapSingleBad /
mcCapOnlyBad but where all capability fields built by TestBuildMachineCapability
match except InactiveDevices differs (e.g., required InactiveDevices X on
instance type and actual machine capability has a different InactiveDevices Y).
Use testCommonBuildInstanceType to create the instance type,
testCommonBuildMachine to create the machine, testCommonBuildMachineInstanceType
to bind them, and TestBuildMachineCapability to set the differing
InactiveDevices on the machine capability; then assert the same rejection path
as the other mismatch cases (the GetUnallocatedMachineForInstanceType flow that
skips machines and returns API error on last failure).

In `@api/pkg/api/handler/util/common/common.go`:
- Around line 966-967: VerifyInstanceTypeMachineCapabilitiesMatch currently
calls MatchInstanceTypeCapabilitiesForMachines without joining the caller's
allocation transaction, causing capability checks to run outside the
transactional context; change the helper to accept and use the caller's
transaction/session (e.g., a transactional *cdb.Session or tx parameter) and
pass that transactional context into MatchInstanceTypeCapabilitiesForMachines so
the capability reads occur within the same transaction/lock as allocation;
update function signature of VerifyInstanceTypeMachineCapabilitiesMatch and all
callers accordingly and ensure MatchInstanceTypeCapabilitiesForMachines reads
use the provided transaction/session.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9298b89a-77b3-4ea4-978f-fcbeef7292fe

📥 Commits

Reviewing files that changed from the base of the PR and between 281c5e5 and 8807b32.

📒 Files selected for processing (5)
  • api/pkg/api/handler/instance.go
  • api/pkg/api/handler/instance_test.go
  • api/pkg/api/handler/util/common/common.go
  • api/pkg/api/handler/util/common/common_test.go
  • api/pkg/api/model/instance.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • api/pkg/api/handler/instance.go
  • api/pkg/api/handler/instance_test.go

Comment on lines +938 to +942
{
name: "success when one machine fails capability match but another machine matches",
instancetype: instCapPair,
expectErr: false,
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert the selected machine, not just success.

This case only checks that the call succeeds. Because candidate order is shuffled, the test still passes if mcCapBad is allocated and the skip logic regresses. Please assert that the returned machine is mcCapGood.ID so the test proves the mismatching candidate was actually skipped.

Also applies to: 958-970

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/util/common/common_test.go` around lines 938 - 942, The
test case "success when one machine fails capability match but another machine
matches" currently only asserts no error; update it to also assert that the
returned machine ID equals mcCapGood.ID (from instCapPair scenario) to ensure
mcCapBad was skipped despite candidate shuffle; locate the call under test (the
function that returns the selected machine in this test) and add an assertion
comparing the returned machine's ID to mcCapGood.ID. Apply the same change to
the similar cases around lines 958-970 that use instCapPair and
mcCapBad/mcCapGood to verify selection rather than just success.

@hwadekar-nv hwadekar-nv force-pushed the feat/instance-type-machine-cap branch from 8807b32 to 78a740e Compare April 17, 2026 19:55
@hwadekar-nv hwadekar-nv force-pushed the feat/instance-type-machine-cap branch from 78a740e to ce9c002 Compare April 20, 2026 16:28
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@hwadekar-nv
Copy link
Copy Markdown
Contributor Author

/ok test 736bfbe

@thossain-nv
Copy link
Copy Markdown
Contributor

We don't want to block the user unless they've specified a requirement that we can't meet with any of the Machines in the pool. So it won't be a check of whether the Machine is compliant with Instance Type, rather whether the Machine can satisfy the user request.

@hwadekar-nv
Copy link
Copy Markdown
Contributor Author

Thanks @thossain-nv. Since the user will be blocked if no machine matches the InstanceType capabilities, we can hold off merging this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants