chore: Update Core proto snapshot for REST components by thossain-nv · Pull Request #251 · NVIDIA/infra-controller-rest

thossain-nv · 2026-03-13T01:33:03Z

Description

Snapshotting Core proto files have been a complex process with hand edits until now. This PR introduces an idempotent script in workflow-schema/cmd/core-proto-fmt/main.go that can be run anytime to fetch latest snapshot of Core proto files. The script accommodates various idiosyncrasies:

Backwards incompatible changes made in the past
Attributes added to REST snapshot but not in Core

Run make core-proto to take latest snapshot

PR also updates REST snapshot to latest proto files from Core

Several non-paginated object retrieval methods were removed from proto, PR also removes the corresponding fallback methods from Site Agent. All objects now have paginated inventory methods in gRPC so fallbacks are no longer needed.

Type of Change

Chore - Modification or removal of existing functionality (chore:)

Services Affected

API - API models or endpoints updated
Workflow - Workflow service updated
Site Agent - Site Agent updated

Related Issues (Optional)

None

Breaking Changes

This PR contains breaking changes

Testing

No testing required (docs, internal refactor, etc.)

Additional Notes

None

copy-pr-bot · 2026-03-13T01:33:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-03-13T01:36:16Z

Test Results

8 540 tests - 8 8 540 ✅ - 8 8m 0s ⏱️ -5s
143 suites - 2 0 💤 ±0
14 files ±0 0 ❌ ±0

Results for commit bdb66e0. ± Comparison against base commit 5fee842.

This pull request removes 8 tests.

github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageDpuExtensionServiceInventory_DiscoverDpuExtensionServiceInventory/test_collecting_and_publishing_dpu_extension_service_inventory_fallback,_empty_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageDpuExtensionServiceInventory_DiscoverDpuExtensionServiceInventory/test_collecting_and_publishing_dpu_extension_service_inventory_fallback,_normal_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageIBPartitionInventory_DiscoverInfiniBandPartitionInventory/test_collecting_and_publishing_ib_partition_inventory_fallback,_empty_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageIBPartitionInventory_DiscoverInfiniBandPartitionInventory/test_collecting_and_publishing_ib_partition_inventory_fallback,_normal_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageSSHKeyGroupInventory_DiscoverSSHKeyGroupInventory/test_collecting_and_publishing_ssh_key_group_inventory_fallback,_empty_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageSSHKeyGroupInventory_DiscoverSSHKeyGroupInventory/test_collecting_and_publishing_ssh_key_group_inventory_fallback,_normal_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageSubnetInventory_DiscoverSubnetInventory/test_collecting_and_publishing_subnet_inventory_fallback,_empty_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageSubnetInventory_DiscoverSubnetInventory/test_collecting_and_publishing_subnet_inventory_fallback,_normal_inventory

♻️ This comment has been updated with latest results.

coderabbitai · 2026-03-13T01:40:18Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Renamed Makefile proto targets and repo paths, migrated iPXE payloads from IpxeOperatingSystem to InlineIpxe, removed multiple deprecated gRPC query methods and their Unimplemented fallback handling, deleted exported workflow interface files, removed inventory fallback helpers, and updated tests and test-server mocks to ID-based endpoints and behaviors.

Changes

Cohort / File(s)	Summary
Build Configuration `Makefile`	Renamed Make targets `carbide-proto` → `core-proto`, `carbide-protogen` → `core-protogen`; switched proto repo/path (`nvidia/carbide-core` → `NVIDIA/ncx-infra-controller-core` / clone dir `carbide-core` → `nico-core`); updated copy/cleanup paths and workflow messages with a manual-update warning.
iPXE Payload Type Migration `api/pkg/api/handler/instance.go`, `api/pkg/api/handler/instancebatch.go`, `site-agent/pkg/components/instance_workflow_test.go`, `site-workflow/pkg/activity/instance_test.go`, `site-workflow/pkg/workflow/instance_test.go`	Replaced `cwssaws.IpxeOperatingSystem` usage with `cwssaws.InlineIpxe` for OS iPXE payloads in request builders and tests; `IpxeScript` values preserved; no control-flow or validation changes.
gRPC Client: Deprecated Query/Fallback Removal `site-workflow/pkg/grpc/client/infiniband_partition.go`, `.../instance.go`, `.../machine.go`, `.../sshkeygroup.go`, `.../subnet.go`, `.../vpc.go`, `.../testing.go`	Removed legacy query methods and their concrete implementations (e.g., `GetInfiniBandPartition`, `GetInstance`, `GetMachine`, `GetSSHKeyGroup`, `GetNetworkSegmentDeprecated`, `ListVPCs`) and deleted gRPC status-code fallback handling; interfaces trimmed to ID/Find/ByIDs patterns; mock client methods pruned/adjusted.
Inventory Discovery: Fallback Hook Removal `site-workflow/pkg/activity/ibpartition.go`, `.../sshkeygroup.go`, `.../subnet.go`	Unwired and deleted `internalFindFallback` hooks and corresponding fallback functions (`ibpFindFallback`, `sshKeyGroupFindFallback`, `subnetFindFallback`); inventory discovery now relies on primary ID/ByIDs/paged paths.
Inventory Tests: Fallback Cases Removed `site-workflow/pkg/activity/dpuextensionservice_test.go`, `.../ibpartition_test.go`, `.../sshkeygroup_test.go`, `.../subnet_test.go`	Deleted fallback-oriented test cases and `findIDsError`/wantError injection; removed unused gRPC error imports; tests simplified to non-fallback success scenarios.
Workflow Dispatch Change `site-agent/pkg/components/managers/sshkeygroup/workflow.go`	Removed `activityGet` handling from controller-operation dispatch and activity invocation; `GetSSHKeyGroup` no longer dispatched/invoked by this workflow metadata path.
Test Server & Test Infra Updates `site-workflow/pkg/grpc/server/forge_test_server.go`, `site-workflow/pkg/grpc/client/testing.go`, `site-agent/pkg/components/carbide_test.go`	Test server: removed query-based handlers (`FindNetworkSegments`, `FindInstances`, `FindIBPartitions`); added ID-centric endpoints/utilities (`InvokeInstancePower`, `FindMachineIds`, tenant keyset CRUD, `CreateIBPartition`, `LoadTestMachines`); mock client and tests updated to use IDs/ByIDs and list responses; adjusted test assertions to index list elements.
Instance Client OS Config Changes `site-workflow/pkg/grpc/client/instance.go`	CreateInstance mappings moved OS-scoped fields into `carbideRequest.Config.Os`: `CustomIpxe` → inline IPXE variant, `UserData` → `Config.Os.UserData` (if present), `AlwaysBootWithCustomIpxe` → `RunProvisioningInstructionsOnEveryBoot`; stopped populating tenant-level `UserData`/`PhoneHomeEnabled`.
Workflow Interface Deletions `workflow-schema/schema/cloud/interface/cloud_interface.go`, `workflow-schema/schema/site-agent/interface/workflow_interface.go`	Deleted entire exported interface files and public workflow interface type declarations for VPC, Subnet, Instance, Machine, SSHKeyGroup, InfiniBandPartition, and related methods.
Misc: Test Adjustments `site-agent/pkg/components/carbide_test.go`, other tests	Updated tests to call ID-based endpoints and adapted expectations to list-by-ids responses (e.g., assert on `NetworkSegments[0]`), removed deprecated endpoints usage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the primary change: updating the Core proto snapshot for REST components, which aligns with the Makefile changes and protobuf regeneration evident throughout the changeset.
Description check	✅ Passed	The pull request description accurately describes the changeset, covering the core-proto snapshot mechanism, removal of deprecated fallback methods, and the refactoring of instance attribute population.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/update-core-proto

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-13T01:41:05Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-03-13 01:41:04 UTC | Commit: 600dd39}

github-actions · 2026-03-13T01:42:36Z

🛡️ Vulnerability Scan

🚨 Found 64 vulnerability(ies)
📊 vs main: 64 (no change)

Severity Breakdown:

🔴 Critical/High: 64
🟡 Medium: 0
🔵 Low/Info: 0

🔗 View full details in Security tab

_{🕐 Last updated: 2026-03-13 01:42:36 UTC | Commit: 600dd39}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

site-workflow/pkg/grpc/server/forge_test_server.go (1)

485-508: ⚠️ Potential issue | 🟠 Major

Incorrect map key check in CreateIBPartition - uses DefaultNetworkSegmentId instead of DefaultIBParitionId.

Line 492 checks f.ibp[DefaultNetworkSegmentId] when it should check f.ibp[DefaultIBParitionId]. This copy-paste error causes the function to always generate a new UUID because the network segment ID will never exist in the IBPartition map.

Proposed fix

 func (f *ForgeServerImpl) CreateIBPartition(c context.Context, req *cwssaws.IBPartitionCreationRequest) (*cwssaws.IBPartition, error) {
 	if req == nil || req.Config == nil {
 		return nil, status.Errorf(codes.InvalidArgument, "Invalid request argument")
 	}
 
 	nid := DefaultIBParitionId
-	_, ok := f.ibp[DefaultNetworkSegmentId]
+	_, ok := f.ibp[DefaultIBParitionId]
 	if ok {
 		// Default IBPartition already exists, create a new one with a different ID
 		nid = uuid.NewString()
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/grpc/server/forge_test_server.go` around lines 485 - 508,
In CreateIBPartition, the map key check uses DefaultNetworkSegmentId instead of
the intended DefaultIBParitionId, causing f.ibp lookups to miss the default
entry; update the lookup in the CreateIBPartition method (the check on
f.ibp[...]) to use DefaultIBParitionId so nid is only replaced with
uuid.NewString() when the default IB partition actually exists, leaving the rest
of the logic (nibp creation and f.ibp[nid] assignment) unchanged.

site-workflow/pkg/grpc/client/subnet.go (1)

148-161: ⚠️ Potential issue | 🟡 Minor

Incorrect log message and span name in GetNetworkSegment.

The log message on Line 149 and the span name on Line 150 both reference GetAllNetworkSegments, but this function is GetNetworkSegment. This will cause confusion in logs and distributed traces when debugging.

Proposed fix

 func (sub *network) GetNetworkSegment(ctx context.Context, request *wflows.UUID) (response *wflows.NetworkSegment, err error) {
-	log.Info().Interface("request", request).Msg("GetAllNetworkSegments: received request")
-	ctx, span := otel.Tracer(os.Getenv("LS_SERVICE_NAME")).Start(ctx, "CarbideClient-GetAllNetworkSegments")
+	log.Info().Interface("request", request).Msg("GetNetworkSegment: received request")
+	ctx, span := otel.Tracer(os.Getenv("LS_SERVICE_NAME")).Start(ctx, "CarbideClient-GetNetworkSegment")
 	defer span.End()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/grpc/client/subnet.go` around lines 148 - 161, The log and
tracing span in the network.GetNetworkSegment method are misnamed as
"GetAllNetworkSegments"; update the log.Info() call (currently
Msg("GetAllNetworkSegments: received request")) to Msg("GetNetworkSegment:
received request") and change the otel span name in otel.Tracer(...).Start(...)
from "CarbideClient-GetAllNetworkSegments" to "CarbideClient-GetNetworkSegment"
so logs and traces correctly reflect the GetNetworkSegment function.

🧹 Nitpick comments (5)

Makefile (2)

230-233: Minor inconsistency: echo vs @echo prefix.

Lines 230 and 233 use echo without the @ prefix, which causes Make to print both the command and its output. The majority of informational messages in this Makefile use @echo to suppress command echoing. While this is consistent with the existing rla-protogen target (line 246), aligning with the dominant pattern improves output clarity.

♻️ Proposed fix for consistency

 	rm -rf nico-core
-	echo "Successfully copied Core protobuf files. However, updating Core proto is a complex process and requires manual editing of copied files. Check for WARNING statements in the diff."
+	`@echo` "Successfully copied Core protobuf files. However, updating Core proto is a complex process and requires manual editing of copied files. Check for WARNING statements in the diff."

 core-protogen:
-	echo "Generating protobuf for Core"
+	`@echo` "Generating protobuf for Core"
 	cd workflow-schema && buf generate

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 230 - 233, Change the two informational echo commands
in the Makefile to use the suppressed form so the command itself isn't printed:
replace the plain echo invocations that emit "Successfully copied Core protobuf
files..." and the one under the core-protogen target that prints "Generating
protobuf for Core" with `@echo` to match the rest of the Makefile (e.g., the
rla-protogen pattern).

222-223: Git operations lack error handling; ls command serves no functional purpose.

The git pull operation on line 222 may fail silently (e.g., merge conflicts, authentication issues), and execution would continue with potentially stale or inconsistent state. Additionally, the ls command on line 223 outputs to stdout but the result is neither validated nor used—if the directory does not exist, the error message will be opaque.

Consider adding explicit validation:

♻️ Proposed improvement for robustness

 core-proto:
-	if [ -d "nico-core" ]; then cd nico-core && git pull; else git clone ssh://git@github.com/NVIDIA/ncx-infra-controller-core.git nico-core; fi
-	ls nico-core/crates/rpc/proto
+	`@if` [ -d "nico-core" ]; then \
+		echo "Updating existing nico-core repository..."; \
+		cd nico-core && git pull || { echo "ERROR: git pull failed"; exit 1; }; \
+	else \
+		echo "Cloning nico-core repository..."; \
+		git clone ssh://git@github.com/NVIDIA/ncx-infra-controller-core.git nico-core || { echo "ERROR: git clone failed"; exit 1; }; \
+	fi
+	`@if` [ ! -d "nico-core/crates/rpc/proto" ]; then \
+		echo "ERROR: Proto directory not found at nico-core/crates/rpc/proto"; \
+		exit 1; \
+	fi
 	`@for` file in nico-core/crates/rpc/proto/*.proto; do \

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 222 - 223, The Makefile currently runs conditional git
pull/clone for the "nico-core" directory and then runs `ls
nico-core/crates/rpc/proto` without validating outcomes; update this block to
add robust error handling by (1) checking the exit status of `git pull` and, on
failure, either abort with a clear error or attempt a fresh clone (delete &
reclone) to avoid stale state, (2) ensure `git clone` failures are detected and
cause the recipe to exit with a nonzero status, and (3) replace the `ls
nico-core/crates/rpc/proto` line with an explicit test like `if [ -d
"nico-core/crates/rpc/proto" ]; then :; else echo "error: missing
nico-core/crates/rpc/proto" >&2; exit 1; fi` so absence is reported and fails
the build; ensure all shell commands use `set -e` or explicit `||` checks to
prevent silent continuation.

site-workflow/pkg/grpc/server/forge_test_server.go (1)

423-440: Include keyset ID in error message for consistency and debuggability.

Line 439 returns an error without including the keyset ID, unlike other methods in this file (e.g., DeleteTenantKeyset at line 456). Including the ID aids in debugging.
Proposed fix
-	return nil, status.Errorf(codes.Internal, "TenantKeyset with ID not found")
+	return nil, status.Errorf(codes.NotFound, "TenantKeyset with ID %q not found", eid)
Note: The error code should also be codes.NotFound rather than codes.Internal to match the semantics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/grpc/server/forge_test_server.go` around lines 423 - 440,
The UpdateTenantKeyset function currently returns a generic internal error when
the keyset ID is missing; update the error to use codes.NotFound and include the
keyset identifier (eid) in the message for consistency with other methods (e.g.,
DeleteTenantKeyset). Locate UpdateTenantKeyset in ForgeServerImpl, use the
existing eid variable and f.tk map check, and change the final return to return
nil, status.Errorf(codes.NotFound, "TenantKeyset with ID %s not found", eid).

api/pkg/api/handler/instancebatch.go (2)

1858-1869: Consider removing unused variable noNvlinkDomainMachines.

The slice noNvlinkDomainMachines is populated at line 1867 but never referenced elsewhere in the function. This constitutes dead code that may cause confusion during future maintenance.

♻️ Suggested cleanup

 		nvlinkDomainMap := make(map[string][]*cdbm.Machine)
-		noNvlinkDomainMachines := []*cdbm.Machine{}
 
 		for idx := range machines {
 			machine := &machines[idx]
 			domainID := getNVLinkDomainID(machine)
 			if domainID != "" {
 				nvlinkDomainMap[domainID] = append(nvlinkDomainMap[domainID], machine)
-			} else {
-				noNvlinkDomainMachines = append(noNvlinkDomainMachines, machine)
 			}
 		}

Alternatively, if machines without NVLink domain information should be handled as a fallback scenario, consider adding a TODO comment documenting the intended future behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instancebatch.go` around lines 1858 - 1869, The local
slice noNvlinkDomainMachines is populated but never used—remove the variable
declaration and the branch that appends to it (the else block that appends to
noNvlinkDomainMachines) to eliminate dead code in the loop over machines; if you
intended to handle machines without an NVLink domain, instead add a TODO comment
near the getNVLinkDomainID usage describing the desired fallback behavior and/or
add handling logic that uses noNvlinkDomainMachines elsewhere in the function.

1870-1876: Potential improvement: Guard against empty nvlinkDomainMap before selection.

When nvlinkDomainMap is empty (all machines lack NVLink domain metadata), bestDomainID remains an empty string, and nvlinkDomainMap[bestDomainID] returns nil. The subsequent length check at line 1881 handles this gracefully, but explicit handling would improve readability and provide a more informative error message.

♻️ Suggested improvement

+		if len(nvlinkDomainMap) == 0 {
+			logger.Warn().Int("totalMachines", len(machines)).
+				Msg("topology optimization enabled but no machines have NVLink domain metadata")
+			return nil, cutil.NewAPIError(http.StatusConflict,
+				"Topology optimization requires NVLink domain information, but no machines have domain metadata", nil)
+		}
+
 		// Find the NVLink domain with the most available machines
 		var bestDomainID string
 		for domainID, domainMachines := range nvlinkDomainMap {
 			if len(domainMachines) > len(nvlinkDomainMap[bestDomainID]) {
 				bestDomainID = domainID
 			}
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instancebatch.go` around lines 1870 - 1876, The loop that
picks bestDomainID from nvlinkDomainMap doesn't guard against nvlinkDomainMap
being empty, leaving bestDomainID as "" and causing a nil lookup; update the
logic around nvlinkDomainMap and bestDomainID (the selection block that iterates
over nvlinkDomainMap) to first check if len(nvlinkDomainMap) == 0 and return or
handle an explicit error (or fallback behavior) with a clear message, otherwise
proceed to compute bestDomainID as before, ensuring all callers of the selection
handle the error/result appropriately.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@site-agent/pkg/components/carbide_test.go`:
- Around line 211-223: The test accesses response.NetworkSegments[0] without
checking the slice length, which can cause a panic; update the test in
carbide_test.go to assert the slice is non-empty before indexing (e.g., use
require.NotEmpty(t, response.NetworkSegments) or require.Greater(t,
len(response.NetworkSegments), 0)) and then assign responseSegment :=
response.NetworkSegments[0]; add the "github.com/stretchr/testify/require"
import if needed so the test fails with a clear assertion instead of panicking.

In `@site-workflow/pkg/grpc/client/instance.go`:
- Around line 144-160: carbideRequest.Config.Os is only created when
request.CustomIpxe is set, but later code unconditionally dereferences
carbideRequest.Config.Os for request.UserData and
request.AlwaysBootWithCustomIpxe, causing a nil-pointer panic; fix by ensuring
carbideRequest.Config.Os is initialized (create a new wflows.OperatingSystem{})
before assigning UserData or RunProvisioningInstructionsOnEveryBoot when either
request.UserData or request.AlwaysBootWithCustomIpxe is non-nil (reuse the same
creation logic used for request.CustomIpxe), then set
carbideRequest.Config.Os.UserData = request.UserData and
carbideRequest.Config.Os.RunProvisioningInstructionsOnEveryBoot =
*request.AlwaysBootWithCustomIpxe as appropriate.

---

Outside diff comments:
In `@site-workflow/pkg/grpc/client/subnet.go`:
- Around line 148-161: The log and tracing span in the network.GetNetworkSegment
method are misnamed as "GetAllNetworkSegments"; update the log.Info() call
(currently Msg("GetAllNetworkSegments: received request")) to
Msg("GetNetworkSegment: received request") and change the otel span name in
otel.Tracer(...).Start(...) from "CarbideClient-GetAllNetworkSegments" to
"CarbideClient-GetNetworkSegment" so logs and traces correctly reflect the
GetNetworkSegment function.

In `@site-workflow/pkg/grpc/server/forge_test_server.go`:
- Around line 485-508: In CreateIBPartition, the map key check uses
DefaultNetworkSegmentId instead of the intended DefaultIBParitionId, causing
f.ibp lookups to miss the default entry; update the lookup in the
CreateIBPartition method (the check on f.ibp[...]) to use DefaultIBParitionId so
nid is only replaced with uuid.NewString() when the default IB partition
actually exists, leaving the rest of the logic (nibp creation and f.ibp[nid]
assignment) unchanged.

---

Nitpick comments:
In `@api/pkg/api/handler/instancebatch.go`:
- Around line 1858-1869: The local slice noNvlinkDomainMachines is populated but
never used—remove the variable declaration and the branch that appends to it
(the else block that appends to noNvlinkDomainMachines) to eliminate dead code
in the loop over machines; if you intended to handle machines without an NVLink
domain, instead add a TODO comment near the getNVLinkDomainID usage describing
the desired fallback behavior and/or add handling logic that uses
noNvlinkDomainMachines elsewhere in the function.
- Around line 1870-1876: The loop that picks bestDomainID from nvlinkDomainMap
doesn't guard against nvlinkDomainMap being empty, leaving bestDomainID as ""
and causing a nil lookup; update the logic around nvlinkDomainMap and
bestDomainID (the selection block that iterates over nvlinkDomainMap) to first
check if len(nvlinkDomainMap) == 0 and return or handle an explicit error (or
fallback behavior) with a clear message, otherwise proceed to compute
bestDomainID as before, ensuring all callers of the selection handle the
error/result appropriately.

In `@Makefile`:
- Around line 230-233: Change the two informational echo commands in the
Makefile to use the suppressed form so the command itself isn't printed: replace
the plain echo invocations that emit "Successfully copied Core protobuf
files..." and the one under the core-protogen target that prints "Generating
protobuf for Core" with `@echo` to match the rest of the Makefile (e.g., the
rla-protogen pattern).
- Around line 222-223: The Makefile currently runs conditional git pull/clone
for the "nico-core" directory and then runs `ls nico-core/crates/rpc/proto`
without validating outcomes; update this block to add robust error handling by
(1) checking the exit status of `git pull` and, on failure, either abort with a
clear error or attempt a fresh clone (delete & reclone) to avoid stale state,
(2) ensure `git clone` failures are detected and cause the recipe to exit with a
nonzero status, and (3) replace the `ls nico-core/crates/rpc/proto` line with an
explicit test like `if [ -d "nico-core/crates/rpc/proto" ]; then :; else echo
"error: missing nico-core/crates/rpc/proto" >&2; exit 1; fi` so absence is
reported and fails the build; ensure all shell commands use `set -e` or explicit
`||` checks to prevent silent continuation.

In `@site-workflow/pkg/grpc/server/forge_test_server.go`:
- Around line 423-440: The UpdateTenantKeyset function currently returns a
generic internal error when the keyset ID is missing; update the error to use
codes.NotFound and include the keyset identifier (eid) in the message for
consistency with other methods (e.g., DeleteTenantKeyset). Locate
UpdateTenantKeyset in ForgeServerImpl, use the existing eid variable and f.tk
map check, and change the final return to return nil,
status.Errorf(codes.NotFound, "TenantKeyset with ID %s not found", eid).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b9054a9a-a694-494b-8182-c78201d86f34

📥 Commits

Reviewing files that changed from the base of the PR and between d99d660 and 7a38900.

⛔ Files ignored due to path filters (16)

workflow-schema/schema/site-agent/workflows/v1/common_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
workflow-schema/schema/site-agent/workflows/v1/dns_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
workflow-schema/schema/site-agent/workflows/v1/forge_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
workflow-schema/schema/site-agent/workflows/v1/forge_carbide_grpc.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
workflow-schema/schema/site-agent/workflows/v1/machine_discovery_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
workflow-schema/schema/site-agent/workflows/v1/mlx_device_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
workflow-schema/schema/site-agent/workflows/v1/nmx_c_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
workflow-schema/schema/site-agent/workflows/v1/nmx_c_carbide_grpc.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
workflow-schema/schema/site-agent/workflows/v1/site_explorer_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
workflow-schema/site-agent/workflows/v1/common_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
workflow-schema/site-agent/workflows/v1/dns_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
workflow-schema/site-agent/workflows/v1/forge_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
workflow-schema/site-agent/workflows/v1/machine_discovery_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
workflow-schema/site-agent/workflows/v1/mlx_device_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
workflow-schema/site-agent/workflows/v1/nmx_c_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
workflow-schema/site-agent/workflows/v1/site_explorer_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto

📒 Files selected for processing (21)

Makefile
api/pkg/api/handler/instance.go
api/pkg/api/handler/instancebatch.go
site-agent/pkg/components/carbide_test.go
site-agent/pkg/components/instance_workflow_test.go
site-agent/pkg/components/managers/sshkeygroup/workflow.go
site-workflow/pkg/activity/ibpartition.go
site-workflow/pkg/activity/instance_test.go
site-workflow/pkg/activity/sshkeygroup.go
site-workflow/pkg/activity/subnet.go
site-workflow/pkg/grpc/client/infiniband_partition.go
site-workflow/pkg/grpc/client/instance.go
site-workflow/pkg/grpc/client/machine.go
site-workflow/pkg/grpc/client/sshkeygroup.go
site-workflow/pkg/grpc/client/subnet.go
site-workflow/pkg/grpc/client/testing.go
site-workflow/pkg/grpc/client/vpc.go
site-workflow/pkg/grpc/server/forge_test_server.go
site-workflow/pkg/workflow/instance_test.go
workflow-schema/schema/cloud/interface/cloud_interface.go
workflow-schema/schema/site-agent/interface/workflow_interface.go

💤 Files with no reviewable changes (11)

site-workflow/pkg/activity/ibpartition.go
site-workflow/pkg/activity/subnet.go
site-agent/pkg/components/managers/sshkeygroup/workflow.go
site-workflow/pkg/grpc/client/machine.go
site-workflow/pkg/grpc/client/sshkeygroup.go
site-workflow/pkg/grpc/client/vpc.go
site-workflow/pkg/grpc/client/infiniband_partition.go
site-workflow/pkg/activity/sshkeygroup.go
workflow-schema/schema/site-agent/interface/workflow_interface.go
site-workflow/pkg/grpc/client/testing.go
workflow-schema/schema/cloud/interface/cloud_interface.go

coderabbitai · 2026-04-06T17:49:35Z

 				assert.Nil(t, err)
 				assert.NotNil(t, response)
-				assert.Equal(t, created.Name, response.Name)
-				assert.Equal(t, created.Id.Value, response.Id.Value)
-				assert.Equal(t, created.Mtu, response.Mtu)
-				assert.Equal(t, created.SubdomainId.Value, response.SubdomainId.Value)
-				assert.Equal(t, created.VpcId.Value, response.VpcId.Value)
-				assert.Equal(t, len(created.Prefixes), len(response.Prefixes))
-				assert.Equal(t, created.Prefixes[0].Prefix, response.Prefixes[0].Prefix)
-				assert.Equal(t, created.Prefixes[0].Gateway, response.Prefixes[0].Gateway)
-				assert.Equal(t, created.Prefixes[0].ReserveFirst, response.Prefixes[0].ReserveFirst)
+				responseSegment := response.NetworkSegments[0]
+				assert.Equal(t, created.Name, responseSegment.Name)
+				assert.Equal(t, created.Id.Value, responseSegment.Id.Value)
+				assert.Equal(t, created.Mtu, responseSegment.Mtu)
+				assert.Equal(t, created.SubdomainId.Value, responseSegment.SubdomainId.Value)
+				assert.Equal(t, created.VpcId.Value, responseSegment.VpcId.Value)
+				assert.Equal(t, len(created.Prefixes), len(responseSegment.Prefixes))
+				assert.Equal(t, created.Prefixes[0].Prefix, responseSegment.Prefixes[0].Prefix)
+				assert.Equal(t, created.Prefixes[0].Gateway, responseSegment.Prefixes[0].Gateway)
+				assert.Equal(t, created.Prefixes[0].ReserveFirst, responseSegment.Prefixes[0].ReserveFirst)
 				t.Log("GRPCResponse", response)


⚠️ Potential issue | 🟡 Minor

Add bounds check before accessing slice element to prevent panic.

Line 213 accesses response.NetworkSegments[0] without verifying the slice is non-empty. If the server returns an empty list (which would indicate a test regression), this will panic rather than fail gracefully with a descriptive assertion error.

Proposed fix using `require`

assert.Nil(t, err) assert.NotNil(t, response) + require.NotEmpty(t, response.NetworkSegments, "expected at least one network segment in response") responseSegment := response.NetworkSegments[0] assert.Equal(t, created.Name, responseSegment.Name)

Note: You may need to add "github.com/stretchr/testify/require" to imports.

As per coding guidelines, use testify (assert/require) for test assertions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@site-agent/pkg/components/carbide_test.go` around lines 211 - 223, The test accesses response.NetworkSegments[0] without checking the slice length, which can cause a panic; update the test in carbide_test.go to assert the slice is non-empty before indexing (e.g., use require.NotEmpty(t, response.NetworkSegments) or require.Greater(t, len(response.NetworkSegments), 0)) and then assign responseSegment := response.NetworkSegments[0]; add the "github.com/stretchr/testify/require" import if needed so the test fails with a clear assertion instead of panicking.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@site-workflow/pkg/grpc/client/instance.go`:
- Around line 146-160: The CreateInstance handler is missing mapping for
request.PhoneHomeEnabled which causes silent data loss; update the mapping in
the same block that sets OS fields (near where CustomIpxe, UserData, and
AlwaysBootWithCustomIpxe are handled) so that
carbideRequest.Config.Os.PhoneHomeEnabled is assigned from
request.PhoneHomeEnabled (and ensure the types match/unpack pointers as done
elsewhere) so the OperatingSystem.phone_home_enabled field is forwarded to the
carbide service.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3d4e83fb-2f66-44cc-a8ec-6ef068b3d9d0

📥 Commits

Reviewing files that changed from the base of the PR and between 7711d41 and a9c0eb8.

📒 Files selected for processing (1)

site-workflow/pkg/grpc/client/instance.go

thossain-nv · 2026-04-06T20:59:41Z

-  reserved 9; // host_nics in carbide-core, not used in REST
+  repeated ExpectedHostNic host_nics = 9;
  optional common.RackId rack_id = 10;
+  // WARNING: Conflict with Core - optional bool default_pause_ingestion_and_poweron = 11;


@chet Can we please resolve Core proto differences before these changes (v1.3.0) hit Production? Otherwise ExpectedMachine inventory might break in Prod.

pbreton · 2026-04-08T15:50:49Z

+	rm -rf workflow-schema/schema/site-agent/workflows/v1/*.pb.go
+
+core-proto-fetch:
+	if [ -d "nico-core" ]; then cd nico-core && git fetch origin && git reset --hard origin/main; else git clone ssh://git@github.com/NVIDIA/ncx-infra-controller-core.git nico-core; fi


What about allowing to use a custom 'nico-core' (using a link) with something like:
[ -h "nico-core" ] || if [ -d "nico-core" ]; then cd nico-core ... else git clone ...; fi

That seems helpful. Let me look into it.

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

…roup Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

…ed merge Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

…EST snapshot Signed-off-by: Tareque Hossain <thossain@nvidia.com>

thossain-nv marked this pull request as draft March 13, 2026 01:34

thossain-nv force-pushed the chore/update-core-proto branch from d592982 to 600dd39 Compare March 13, 2026 01:40

thossain-nv force-pushed the chore/update-core-proto branch from 600dd39 to 97ece6a Compare April 3, 2026 23:31

thossain-nv marked this pull request as ready for review April 6, 2026 17:40

thossain-nv requested a review from a team as a code owner April 6, 2026 17:40

thossain-nv force-pushed the chore/update-core-proto branch from 5f3a8c8 to 7a38900 Compare April 6, 2026 17:40

coderabbitai Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread site-workflow/pkg/grpc/client/instance.go

thossain-nv force-pushed the chore/update-core-proto branch from a9c0eb8 to d854c97 Compare April 6, 2026 20:44

thossain-nv commented Apr 6, 2026

View reviewed changes

thossain-nv force-pushed the chore/update-core-proto branch from 5832527 to a76859e Compare April 7, 2026 22:58

thossain-nv requested review from hwadekar-nv, kfelternv, parmani-nv and pbreton April 7, 2026 22:58

pbreton reviewed Apr 8, 2026

View reviewed changes

thossain-nv force-pushed the chore/update-core-proto branch from 6177b94 to 9d0befc Compare April 8, 2026 16:51

thossain-nv commented Apr 8, 2026

View reviewed changes

Comment thread Makefile

thossain-nv added 7 commits April 8, 2026 13:16

chore: Update Core proto snapshot

4f38574

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Updated Go modules to remove deprecated proto methods

5be4872

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Manually resolve proto module import

a94f445

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Fix inconsistent proto package naming

e5c32e6

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Fix IB Partition inventory fallback tests

9bf0fbe

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Remove deprecated inventory fallback methods for Subnet and SSH Key G…

0d3f6f7

…roup Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Improve attribute population for legacy create Instance workflow

4677d5b

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

thossain-nv added 6 commits April 8, 2026 13:16

Added Core proto formatting, re-order forge proto messages for improv…

8a41616

…ed merge Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Added Core snapshot script, update to latest proto

f6fff8c

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Update Go formatter path

c029009

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Remove unnecessary proto/buf cleaning in make command

069ca12

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Allow snapshotting in forked repo

9f27207

Signed-off-by: Tareque Hossain <thossain@nvidia.com>

Use higher index for additional expected object attributes added to R…

bdb66e0

…EST snapshot Signed-off-by: Tareque Hossain <thossain@nvidia.com>

thossain-nv force-pushed the chore/update-core-proto branch from 9d0befc to bdb66e0 Compare April 8, 2026 20:16

hwadekar-nv approved these changes Apr 8, 2026

View reviewed changes

thossain-nv merged commit 371c219 into main Apr 8, 2026
51 checks passed

thossain-nv deleted the chore/update-core-proto branch April 8, 2026 23:18

Conversation

thossain-nv commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Services Affected

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

copy-pr-bot Bot commented Mar 13, 2026

Uh oh!

github-actions Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

coderabbitai Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Mar 13, 2026

🔐 TruffleHog Secret Scan

Uh oh!

github-actions Bot commented Mar 13, 2026

🛡️ Vulnerability Scan

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thossain-nv Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

pbreton Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

thossain-nv Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thossain-nv commented Mar 13, 2026 •

edited

Loading

github-actions Bot commented Mar 13, 2026 •

edited

Loading

coderabbitai Bot commented Mar 13, 2026 •

edited

Loading