Skip to content

chore: Update Core proto snapshot for REST components#251

Merged
thossain-nv merged 13 commits intomainfrom
chore/update-core-proto
Apr 8, 2026
Merged

chore: Update Core proto snapshot for REST components#251
thossain-nv merged 13 commits intomainfrom
chore/update-core-proto

Conversation

@thossain-nv
Copy link
Copy Markdown
Contributor

@thossain-nv thossain-nv commented Mar 13, 2026

Description

Snapshotting Core proto files have been a complex process with hand edits until now. This PR introduces an idempotent script in workflow-schema/cmd/core-proto-fmt/main.go that can be run anytime to fetch latest snapshot of Core proto files. The script accommodates various idiosyncrasies:

  • Backwards incompatible changes made in the past
  • Attributes added to REST snapshot but not in Core

Run make core-proto to take latest snapshot

PR also updates REST snapshot to latest proto files from Core

Several non-paginated object retrieval methods were removed from proto, PR also removes the corresponding fallback methods from Site Agent. All objects now have paginated inventory methods in gRPC so fallbacks are no longer needed.

Type of Change

  • Chore - Modification or removal of existing functionality (chore:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • Site Agent - Site Agent updated

Related Issues (Optional)

None

Breaking Changes

  • This PR contains breaking changes

Testing

  • No testing required (docs, internal refactor, etc.)

Additional Notes

None

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@thossain-nv thossain-nv marked this pull request as draft March 13, 2026 01:34
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 13, 2026

Test Results

8 540 tests   - 8   8 540 ✅  - 8   8m 0s ⏱️ -5s
  143 suites  - 2       0 💤 ±0 
   14 files   ±0       0 ❌ ±0 

Results for commit bdb66e0. ± Comparison against base commit 5fee842.

This pull request removes 8 tests.
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageDpuExtensionServiceInventory_DiscoverDpuExtensionServiceInventory/test_collecting_and_publishing_dpu_extension_service_inventory_fallback,_empty_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageDpuExtensionServiceInventory_DiscoverDpuExtensionServiceInventory/test_collecting_and_publishing_dpu_extension_service_inventory_fallback,_normal_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageIBPartitionInventory_DiscoverInfiniBandPartitionInventory/test_collecting_and_publishing_ib_partition_inventory_fallback,_empty_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageIBPartitionInventory_DiscoverInfiniBandPartitionInventory/test_collecting_and_publishing_ib_partition_inventory_fallback,_normal_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageSSHKeyGroupInventory_DiscoverSSHKeyGroupInventory/test_collecting_and_publishing_ssh_key_group_inventory_fallback,_empty_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageSSHKeyGroupInventory_DiscoverSSHKeyGroupInventory/test_collecting_and_publishing_ssh_key_group_inventory_fallback,_normal_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageSubnetInventory_DiscoverSubnetInventory/test_collecting_and_publishing_subnet_inventory_fallback,_empty_inventory
github.com/NVIDIA/ncx-infra-controller-rest/site-workflow/pkg/activity ‑ TestManageSubnetInventory_DiscoverSubnetInventory/test_collecting_and_publishing_subnet_inventory_fallback,_normal_inventory

♻️ This comment has been updated with latest results.

@thossain-nv thossain-nv force-pushed the chore/update-core-proto branch from d592982 to 600dd39 Compare March 13, 2026 01:40
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Renamed Makefile proto targets and repo paths, migrated iPXE payloads from IpxeOperatingSystem to InlineIpxe, removed multiple deprecated gRPC query methods and their Unimplemented fallback handling, deleted exported workflow interface files, removed inventory fallback helpers, and updated tests and test-server mocks to ID-based endpoints and behaviors.

Changes

Cohort / File(s) Summary
Build Configuration
Makefile
Renamed Make targets carbide-protocore-proto, carbide-protogencore-protogen; switched proto repo/path (nvidia/carbide-coreNVIDIA/ncx-infra-controller-core / clone dir carbide-corenico-core); updated copy/cleanup paths and workflow messages with a manual-update warning.
iPXE Payload Type Migration
api/pkg/api/handler/instance.go, api/pkg/api/handler/instancebatch.go, site-agent/pkg/components/instance_workflow_test.go, site-workflow/pkg/activity/instance_test.go, site-workflow/pkg/workflow/instance_test.go
Replaced cwssaws.IpxeOperatingSystem usage with cwssaws.InlineIpxe for OS iPXE payloads in request builders and tests; IpxeScript values preserved; no control-flow or validation changes.
gRPC Client: Deprecated Query/Fallback Removal
site-workflow/pkg/grpc/client/infiniband_partition.go, .../instance.go, .../machine.go, .../sshkeygroup.go, .../subnet.go, .../vpc.go, .../testing.go
Removed legacy query methods and their concrete implementations (e.g., GetInfiniBandPartition, GetInstance, GetMachine, GetSSHKeyGroup, GetNetworkSegmentDeprecated, ListVPCs) and deleted gRPC status-code fallback handling; interfaces trimmed to ID/Find/ByIDs patterns; mock client methods pruned/adjusted.
Inventory Discovery: Fallback Hook Removal
site-workflow/pkg/activity/ibpartition.go, .../sshkeygroup.go, .../subnet.go
Unwired and deleted internalFindFallback hooks and corresponding fallback functions (ibpFindFallback, sshKeyGroupFindFallback, subnetFindFallback); inventory discovery now relies on primary ID/ByIDs/paged paths.
Inventory Tests: Fallback Cases Removed
site-workflow/pkg/activity/dpuextensionservice_test.go, .../ibpartition_test.go, .../sshkeygroup_test.go, .../subnet_test.go
Deleted fallback-oriented test cases and findIDsError/wantError injection; removed unused gRPC error imports; tests simplified to non-fallback success scenarios.
Workflow Dispatch Change
site-agent/pkg/components/managers/sshkeygroup/workflow.go
Removed activityGet handling from controller-operation dispatch and activity invocation; GetSSHKeyGroup no longer dispatched/invoked by this workflow metadata path.
Test Server & Test Infra Updates
site-workflow/pkg/grpc/server/forge_test_server.go, site-workflow/pkg/grpc/client/testing.go, site-agent/pkg/components/carbide_test.go
Test server: removed query-based handlers (FindNetworkSegments, FindInstances, FindIBPartitions); added ID-centric endpoints/utilities (InvokeInstancePower, FindMachineIds, tenant keyset CRUD, CreateIBPartition, LoadTestMachines); mock client and tests updated to use IDs/ByIDs and list responses; adjusted test assertions to index list elements.
Instance Client OS Config Changes
site-workflow/pkg/grpc/client/instance.go
CreateInstance mappings moved OS-scoped fields into carbideRequest.Config.Os: CustomIpxe → inline IPXE variant, UserDataConfig.Os.UserData (if present), AlwaysBootWithCustomIpxeRunProvisioningInstructionsOnEveryBoot; stopped populating tenant-level UserData/PhoneHomeEnabled.
Workflow Interface Deletions
workflow-schema/schema/cloud/interface/cloud_interface.go, workflow-schema/schema/site-agent/interface/workflow_interface.go
Deleted entire exported interface files and public workflow interface type declarations for VPC, Subnet, Instance, Machine, SSHKeyGroup, InfiniBandPartition, and related methods.
Misc: Test Adjustments
site-agent/pkg/components/carbide_test.go, other tests
Updated tests to call ID-based endpoints and adapted expectations to list-by-ids responses (e.g., assert on NetworkSegments[0]), removed deprecated endpoints usage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the primary change: updating the Core proto snapshot for REST components, which aligns with the Makefile changes and protobuf regeneration evident throughout the changeset.
Description check ✅ Passed The pull request description accurately describes the changeset, covering the core-proto snapshot mechanism, removal of deprecated fallback methods, and the refactoring of instance attribute population.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/update-core-proto

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-13 01:41:04 UTC | Commit: 600dd39

@github-actions
Copy link
Copy Markdown

🛡️ Vulnerability Scan

🚨 Found 64 vulnerability(ies)
📊 vs main: 64 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 64
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-13 01:42:36 UTC | Commit: 600dd39

@thossain-nv thossain-nv force-pushed the chore/update-core-proto branch from 600dd39 to 97ece6a Compare April 3, 2026 23:31
@thossain-nv thossain-nv marked this pull request as ready for review April 6, 2026 17:40
@thossain-nv thossain-nv requested a review from a team as a code owner April 6, 2026 17:40
@thossain-nv thossain-nv force-pushed the chore/update-core-proto branch from 5f3a8c8 to 7a38900 Compare April 6, 2026 17:40
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
site-workflow/pkg/grpc/server/forge_test_server.go (1)

485-508: ⚠️ Potential issue | 🟠 Major

Incorrect map key check in CreateIBPartition - uses DefaultNetworkSegmentId instead of DefaultIBParitionId.

Line 492 checks f.ibp[DefaultNetworkSegmentId] when it should check f.ibp[DefaultIBParitionId]. This copy-paste error causes the function to always generate a new UUID because the network segment ID will never exist in the IBPartition map.

Proposed fix
 func (f *ForgeServerImpl) CreateIBPartition(c context.Context, req *cwssaws.IBPartitionCreationRequest) (*cwssaws.IBPartition, error) {
 	if req == nil || req.Config == nil {
 		return nil, status.Errorf(codes.InvalidArgument, "Invalid request argument")
 	}
 
 	nid := DefaultIBParitionId
-	_, ok := f.ibp[DefaultNetworkSegmentId]
+	_, ok := f.ibp[DefaultIBParitionId]
 	if ok {
 		// Default IBPartition already exists, create a new one with a different ID
 		nid = uuid.NewString()
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/grpc/server/forge_test_server.go` around lines 485 - 508,
In CreateIBPartition, the map key check uses DefaultNetworkSegmentId instead of
the intended DefaultIBParitionId, causing f.ibp lookups to miss the default
entry; update the lookup in the CreateIBPartition method (the check on
f.ibp[...]) to use DefaultIBParitionId so nid is only replaced with
uuid.NewString() when the default IB partition actually exists, leaving the rest
of the logic (nibp creation and f.ibp[nid] assignment) unchanged.
site-workflow/pkg/grpc/client/subnet.go (1)

148-161: ⚠️ Potential issue | 🟡 Minor

Incorrect log message and span name in GetNetworkSegment.

The log message on Line 149 and the span name on Line 150 both reference GetAllNetworkSegments, but this function is GetNetworkSegment. This will cause confusion in logs and distributed traces when debugging.

Proposed fix
 func (sub *network) GetNetworkSegment(ctx context.Context, request *wflows.UUID) (response *wflows.NetworkSegment, err error) {
-	log.Info().Interface("request", request).Msg("GetAllNetworkSegments: received request")
-	ctx, span := otel.Tracer(os.Getenv("LS_SERVICE_NAME")).Start(ctx, "CarbideClient-GetAllNetworkSegments")
+	log.Info().Interface("request", request).Msg("GetNetworkSegment: received request")
+	ctx, span := otel.Tracer(os.Getenv("LS_SERVICE_NAME")).Start(ctx, "CarbideClient-GetNetworkSegment")
 	defer span.End()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/grpc/client/subnet.go` around lines 148 - 161, The log and
tracing span in the network.GetNetworkSegment method are misnamed as
"GetAllNetworkSegments"; update the log.Info() call (currently
Msg("GetAllNetworkSegments: received request")) to Msg("GetNetworkSegment:
received request") and change the otel span name in otel.Tracer(...).Start(...)
from "CarbideClient-GetAllNetworkSegments" to "CarbideClient-GetNetworkSegment"
so logs and traces correctly reflect the GetNetworkSegment function.
🧹 Nitpick comments (5)
Makefile (2)

230-233: Minor inconsistency: echo vs @echo prefix.

Lines 230 and 233 use echo without the @ prefix, which causes Make to print both the command and its output. The majority of informational messages in this Makefile use @echo to suppress command echoing. While this is consistent with the existing rla-protogen target (line 246), aligning with the dominant pattern improves output clarity.

♻️ Proposed fix for consistency
 	rm -rf nico-core
-	echo "Successfully copied Core protobuf files. However, updating Core proto is a complex process and requires manual editing of copied files. Check for WARNING statements in the diff."
+	`@echo` "Successfully copied Core protobuf files. However, updating Core proto is a complex process and requires manual editing of copied files. Check for WARNING statements in the diff."

 core-protogen:
-	echo "Generating protobuf for Core"
+	`@echo` "Generating protobuf for Core"
 	cd workflow-schema && buf generate
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 230 - 233, Change the two informational echo commands
in the Makefile to use the suppressed form so the command itself isn't printed:
replace the plain echo invocations that emit "Successfully copied Core protobuf
files..." and the one under the core-protogen target that prints "Generating
protobuf for Core" with `@echo` to match the rest of the Makefile (e.g., the
rla-protogen pattern).

222-223: Git operations lack error handling; ls command serves no functional purpose.

The git pull operation on line 222 may fail silently (e.g., merge conflicts, authentication issues), and execution would continue with potentially stale or inconsistent state. Additionally, the ls command on line 223 outputs to stdout but the result is neither validated nor used—if the directory does not exist, the error message will be opaque.

Consider adding explicit validation:

♻️ Proposed improvement for robustness
 core-proto:
-	if [ -d "nico-core" ]; then cd nico-core && git pull; else git clone ssh://git@github.com/NVIDIA/ncx-infra-controller-core.git nico-core; fi
-	ls nico-core/crates/rpc/proto
+	`@if` [ -d "nico-core" ]; then \
+		echo "Updating existing nico-core repository..."; \
+		cd nico-core && git pull || { echo "ERROR: git pull failed"; exit 1; }; \
+	else \
+		echo "Cloning nico-core repository..."; \
+		git clone ssh://git@github.com/NVIDIA/ncx-infra-controller-core.git nico-core || { echo "ERROR: git clone failed"; exit 1; }; \
+	fi
+	`@if` [ ! -d "nico-core/crates/rpc/proto" ]; then \
+		echo "ERROR: Proto directory not found at nico-core/crates/rpc/proto"; \
+		exit 1; \
+	fi
 	`@for` file in nico-core/crates/rpc/proto/*.proto; do \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 222 - 223, The Makefile currently runs conditional git
pull/clone for the "nico-core" directory and then runs `ls
nico-core/crates/rpc/proto` without validating outcomes; update this block to
add robust error handling by (1) checking the exit status of `git pull` and, on
failure, either abort with a clear error or attempt a fresh clone (delete &
reclone) to avoid stale state, (2) ensure `git clone` failures are detected and
cause the recipe to exit with a nonzero status, and (3) replace the `ls
nico-core/crates/rpc/proto` line with an explicit test like `if [ -d
"nico-core/crates/rpc/proto" ]; then :; else echo "error: missing
nico-core/crates/rpc/proto" >&2; exit 1; fi` so absence is reported and fails
the build; ensure all shell commands use `set -e` or explicit `||` checks to
prevent silent continuation.
site-workflow/pkg/grpc/server/forge_test_server.go (1)

423-440: Include keyset ID in error message for consistency and debuggability.

Line 439 returns an error without including the keyset ID, unlike other methods in this file (e.g., DeleteTenantKeyset at line 456). Including the ID aids in debugging.

Proposed fix
-	return nil, status.Errorf(codes.Internal, "TenantKeyset with ID not found")
+	return nil, status.Errorf(codes.NotFound, "TenantKeyset with ID %q not found", eid)

Note: The error code should also be codes.NotFound rather than codes.Internal to match the semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/grpc/server/forge_test_server.go` around lines 423 - 440,
The UpdateTenantKeyset function currently returns a generic internal error when
the keyset ID is missing; update the error to use codes.NotFound and include the
keyset identifier (eid) in the message for consistency with other methods (e.g.,
DeleteTenantKeyset). Locate UpdateTenantKeyset in ForgeServerImpl, use the
existing eid variable and f.tk map check, and change the final return to return
nil, status.Errorf(codes.NotFound, "TenantKeyset with ID %s not found", eid).
api/pkg/api/handler/instancebatch.go (2)

1858-1869: Consider removing unused variable noNvlinkDomainMachines.

The slice noNvlinkDomainMachines is populated at line 1867 but never referenced elsewhere in the function. This constitutes dead code that may cause confusion during future maintenance.

♻️ Suggested cleanup
 		nvlinkDomainMap := make(map[string][]*cdbm.Machine)
-		noNvlinkDomainMachines := []*cdbm.Machine{}
 
 		for idx := range machines {
 			machine := &machines[idx]
 			domainID := getNVLinkDomainID(machine)
 			if domainID != "" {
 				nvlinkDomainMap[domainID] = append(nvlinkDomainMap[domainID], machine)
-			} else {
-				noNvlinkDomainMachines = append(noNvlinkDomainMachines, machine)
 			}
 		}

Alternatively, if machines without NVLink domain information should be handled as a fallback scenario, consider adding a TODO comment documenting the intended future behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instancebatch.go` around lines 1858 - 1869, The local
slice noNvlinkDomainMachines is populated but never used—remove the variable
declaration and the branch that appends to it (the else block that appends to
noNvlinkDomainMachines) to eliminate dead code in the loop over machines; if you
intended to handle machines without an NVLink domain, instead add a TODO comment
near the getNVLinkDomainID usage describing the desired fallback behavior and/or
add handling logic that uses noNvlinkDomainMachines elsewhere in the function.

1870-1876: Potential improvement: Guard against empty nvlinkDomainMap before selection.

When nvlinkDomainMap is empty (all machines lack NVLink domain metadata), bestDomainID remains an empty string, and nvlinkDomainMap[bestDomainID] returns nil. The subsequent length check at line 1881 handles this gracefully, but explicit handling would improve readability and provide a more informative error message.

♻️ Suggested improvement
+		if len(nvlinkDomainMap) == 0 {
+			logger.Warn().Int("totalMachines", len(machines)).
+				Msg("topology optimization enabled but no machines have NVLink domain metadata")
+			return nil, cutil.NewAPIError(http.StatusConflict,
+				"Topology optimization requires NVLink domain information, but no machines have domain metadata", nil)
+		}
+
 		// Find the NVLink domain with the most available machines
 		var bestDomainID string
 		for domainID, domainMachines := range nvlinkDomainMap {
 			if len(domainMachines) > len(nvlinkDomainMap[bestDomainID]) {
 				bestDomainID = domainID
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instancebatch.go` around lines 1870 - 1876, The loop that
picks bestDomainID from nvlinkDomainMap doesn't guard against nvlinkDomainMap
being empty, leaving bestDomainID as "" and causing a nil lookup; update the
logic around nvlinkDomainMap and bestDomainID (the selection block that iterates
over nvlinkDomainMap) to first check if len(nvlinkDomainMap) == 0 and return or
handle an explicit error (or fallback behavior) with a clear message, otherwise
proceed to compute bestDomainID as before, ensuring all callers of the selection
handle the error/result appropriately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@site-agent/pkg/components/carbide_test.go`:
- Around line 211-223: The test accesses response.NetworkSegments[0] without
checking the slice length, which can cause a panic; update the test in
carbide_test.go to assert the slice is non-empty before indexing (e.g., use
require.NotEmpty(t, response.NetworkSegments) or require.Greater(t,
len(response.NetworkSegments), 0)) and then assign responseSegment :=
response.NetworkSegments[0]; add the "github.com/stretchr/testify/require"
import if needed so the test fails with a clear assertion instead of panicking.

In `@site-workflow/pkg/grpc/client/instance.go`:
- Around line 144-160: carbideRequest.Config.Os is only created when
request.CustomIpxe is set, but later code unconditionally dereferences
carbideRequest.Config.Os for request.UserData and
request.AlwaysBootWithCustomIpxe, causing a nil-pointer panic; fix by ensuring
carbideRequest.Config.Os is initialized (create a new wflows.OperatingSystem{})
before assigning UserData or RunProvisioningInstructionsOnEveryBoot when either
request.UserData or request.AlwaysBootWithCustomIpxe is non-nil (reuse the same
creation logic used for request.CustomIpxe), then set
carbideRequest.Config.Os.UserData = request.UserData and
carbideRequest.Config.Os.RunProvisioningInstructionsOnEveryBoot =
*request.AlwaysBootWithCustomIpxe as appropriate.

---

Outside diff comments:
In `@site-workflow/pkg/grpc/client/subnet.go`:
- Around line 148-161: The log and tracing span in the network.GetNetworkSegment
method are misnamed as "GetAllNetworkSegments"; update the log.Info() call
(currently Msg("GetAllNetworkSegments: received request")) to
Msg("GetNetworkSegment: received request") and change the otel span name in
otel.Tracer(...).Start(...) from "CarbideClient-GetAllNetworkSegments" to
"CarbideClient-GetNetworkSegment" so logs and traces correctly reflect the
GetNetworkSegment function.

In `@site-workflow/pkg/grpc/server/forge_test_server.go`:
- Around line 485-508: In CreateIBPartition, the map key check uses
DefaultNetworkSegmentId instead of the intended DefaultIBParitionId, causing
f.ibp lookups to miss the default entry; update the lookup in the
CreateIBPartition method (the check on f.ibp[...]) to use DefaultIBParitionId so
nid is only replaced with uuid.NewString() when the default IB partition
actually exists, leaving the rest of the logic (nibp creation and f.ibp[nid]
assignment) unchanged.

---

Nitpick comments:
In `@api/pkg/api/handler/instancebatch.go`:
- Around line 1858-1869: The local slice noNvlinkDomainMachines is populated but
never used—remove the variable declaration and the branch that appends to it
(the else block that appends to noNvlinkDomainMachines) to eliminate dead code
in the loop over machines; if you intended to handle machines without an NVLink
domain, instead add a TODO comment near the getNVLinkDomainID usage describing
the desired fallback behavior and/or add handling logic that uses
noNvlinkDomainMachines elsewhere in the function.
- Around line 1870-1876: The loop that picks bestDomainID from nvlinkDomainMap
doesn't guard against nvlinkDomainMap being empty, leaving bestDomainID as ""
and causing a nil lookup; update the logic around nvlinkDomainMap and
bestDomainID (the selection block that iterates over nvlinkDomainMap) to first
check if len(nvlinkDomainMap) == 0 and return or handle an explicit error (or
fallback behavior) with a clear message, otherwise proceed to compute
bestDomainID as before, ensuring all callers of the selection handle the
error/result appropriately.

In `@Makefile`:
- Around line 230-233: Change the two informational echo commands in the
Makefile to use the suppressed form so the command itself isn't printed: replace
the plain echo invocations that emit "Successfully copied Core protobuf
files..." and the one under the core-protogen target that prints "Generating
protobuf for Core" with `@echo` to match the rest of the Makefile (e.g., the
rla-protogen pattern).
- Around line 222-223: The Makefile currently runs conditional git pull/clone
for the "nico-core" directory and then runs `ls nico-core/crates/rpc/proto`
without validating outcomes; update this block to add robust error handling by
(1) checking the exit status of `git pull` and, on failure, either abort with a
clear error or attempt a fresh clone (delete & reclone) to avoid stale state,
(2) ensure `git clone` failures are detected and cause the recipe to exit with a
nonzero status, and (3) replace the `ls nico-core/crates/rpc/proto` line with an
explicit test like `if [ -d "nico-core/crates/rpc/proto" ]; then :; else echo
"error: missing nico-core/crates/rpc/proto" >&2; exit 1; fi` so absence is
reported and fails the build; ensure all shell commands use `set -e` or explicit
`||` checks to prevent silent continuation.

In `@site-workflow/pkg/grpc/server/forge_test_server.go`:
- Around line 423-440: The UpdateTenantKeyset function currently returns a
generic internal error when the keyset ID is missing; update the error to use
codes.NotFound and include the keyset identifier (eid) in the message for
consistency with other methods (e.g., DeleteTenantKeyset). Locate
UpdateTenantKeyset in ForgeServerImpl, use the existing eid variable and f.tk
map check, and change the final return to return nil,
status.Errorf(codes.NotFound, "TenantKeyset with ID %s not found", eid).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b9054a9a-a694-494b-8182-c78201d86f34

📥 Commits

Reviewing files that changed from the base of the PR and between d99d660 and 7a38900.

⛔ Files ignored due to path filters (16)
  • workflow-schema/schema/site-agent/workflows/v1/common_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • workflow-schema/schema/site-agent/workflows/v1/dns_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • workflow-schema/schema/site-agent/workflows/v1/forge_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • workflow-schema/schema/site-agent/workflows/v1/forge_carbide_grpc.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • workflow-schema/schema/site-agent/workflows/v1/machine_discovery_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • workflow-schema/schema/site-agent/workflows/v1/mlx_device_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • workflow-schema/schema/site-agent/workflows/v1/nmx_c_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • workflow-schema/schema/site-agent/workflows/v1/nmx_c_carbide_grpc.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • workflow-schema/schema/site-agent/workflows/v1/site_explorer_carbide.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • workflow-schema/site-agent/workflows/v1/common_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
  • workflow-schema/site-agent/workflows/v1/dns_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
  • workflow-schema/site-agent/workflows/v1/forge_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
  • workflow-schema/site-agent/workflows/v1/machine_discovery_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
  • workflow-schema/site-agent/workflows/v1/mlx_device_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
  • workflow-schema/site-agent/workflows/v1/nmx_c_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
  • workflow-schema/site-agent/workflows/v1/site_explorer_carbide.proto is excluded by !workflow-schema/site-agent/workflows/v1/*_carbide.proto
📒 Files selected for processing (21)
  • Makefile
  • api/pkg/api/handler/instance.go
  • api/pkg/api/handler/instancebatch.go
  • site-agent/pkg/components/carbide_test.go
  • site-agent/pkg/components/instance_workflow_test.go
  • site-agent/pkg/components/managers/sshkeygroup/workflow.go
  • site-workflow/pkg/activity/ibpartition.go
  • site-workflow/pkg/activity/instance_test.go
  • site-workflow/pkg/activity/sshkeygroup.go
  • site-workflow/pkg/activity/subnet.go
  • site-workflow/pkg/grpc/client/infiniband_partition.go
  • site-workflow/pkg/grpc/client/instance.go
  • site-workflow/pkg/grpc/client/machine.go
  • site-workflow/pkg/grpc/client/sshkeygroup.go
  • site-workflow/pkg/grpc/client/subnet.go
  • site-workflow/pkg/grpc/client/testing.go
  • site-workflow/pkg/grpc/client/vpc.go
  • site-workflow/pkg/grpc/server/forge_test_server.go
  • site-workflow/pkg/workflow/instance_test.go
  • workflow-schema/schema/cloud/interface/cloud_interface.go
  • workflow-schema/schema/site-agent/interface/workflow_interface.go
💤 Files with no reviewable changes (11)
  • site-workflow/pkg/activity/ibpartition.go
  • site-workflow/pkg/activity/subnet.go
  • site-agent/pkg/components/managers/sshkeygroup/workflow.go
  • site-workflow/pkg/grpc/client/machine.go
  • site-workflow/pkg/grpc/client/sshkeygroup.go
  • site-workflow/pkg/grpc/client/vpc.go
  • site-workflow/pkg/grpc/client/infiniband_partition.go
  • site-workflow/pkg/activity/sshkeygroup.go
  • workflow-schema/schema/site-agent/interface/workflow_interface.go
  • site-workflow/pkg/grpc/client/testing.go
  • workflow-schema/schema/cloud/interface/cloud_interface.go

Comment on lines 211 to 223
assert.Nil(t, err)
assert.NotNil(t, response)
assert.Equal(t, created.Name, response.Name)
assert.Equal(t, created.Id.Value, response.Id.Value)
assert.Equal(t, created.Mtu, response.Mtu)
assert.Equal(t, created.SubdomainId.Value, response.SubdomainId.Value)
assert.Equal(t, created.VpcId.Value, response.VpcId.Value)
assert.Equal(t, len(created.Prefixes), len(response.Prefixes))
assert.Equal(t, created.Prefixes[0].Prefix, response.Prefixes[0].Prefix)
assert.Equal(t, created.Prefixes[0].Gateway, response.Prefixes[0].Gateway)
assert.Equal(t, created.Prefixes[0].ReserveFirst, response.Prefixes[0].ReserveFirst)
responseSegment := response.NetworkSegments[0]
assert.Equal(t, created.Name, responseSegment.Name)
assert.Equal(t, created.Id.Value, responseSegment.Id.Value)
assert.Equal(t, created.Mtu, responseSegment.Mtu)
assert.Equal(t, created.SubdomainId.Value, responseSegment.SubdomainId.Value)
assert.Equal(t, created.VpcId.Value, responseSegment.VpcId.Value)
assert.Equal(t, len(created.Prefixes), len(responseSegment.Prefixes))
assert.Equal(t, created.Prefixes[0].Prefix, responseSegment.Prefixes[0].Prefix)
assert.Equal(t, created.Prefixes[0].Gateway, responseSegment.Prefixes[0].Gateway)
assert.Equal(t, created.Prefixes[0].ReserveFirst, responseSegment.Prefixes[0].ReserveFirst)
t.Log("GRPCResponse", response)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add bounds check before accessing slice element to prevent panic.

Line 213 accesses response.NetworkSegments[0] without verifying the slice is non-empty. If the server returns an empty list (which would indicate a test regression), this will panic rather than fail gracefully with a descriptive assertion error.

Proposed fix using `require`
 				assert.Nil(t, err)
 				assert.NotNil(t, response)
+				require.NotEmpty(t, response.NetworkSegments, "expected at least one network segment in response")
 				responseSegment := response.NetworkSegments[0]
 				assert.Equal(t, created.Name, responseSegment.Name)

Note: You may need to add "github.com/stretchr/testify/require" to imports.

As per coding guidelines, use testify (assert/require) for test assertions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@site-agent/pkg/components/carbide_test.go` around lines 211 - 223, The test
accesses response.NetworkSegments[0] without checking the slice length, which
can cause a panic; update the test in carbide_test.go to assert the slice is
non-empty before indexing (e.g., use require.NotEmpty(t,
response.NetworkSegments) or require.Greater(t, len(response.NetworkSegments),
0)) and then assign responseSegment := response.NetworkSegments[0]; add the
"github.com/stretchr/testify/require" import if needed so the test fails with a
clear assertion instead of panicking.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@site-workflow/pkg/grpc/client/instance.go`:
- Around line 146-160: The CreateInstance handler is missing mapping for
request.PhoneHomeEnabled which causes silent data loss; update the mapping in
the same block that sets OS fields (near where CustomIpxe, UserData, and
AlwaysBootWithCustomIpxe are handled) so that
carbideRequest.Config.Os.PhoneHomeEnabled is assigned from
request.PhoneHomeEnabled (and ensure the types match/unpack pointers as done
elsewhere) so the OperatingSystem.phone_home_enabled field is forwarded to the
carbide service.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3d4e83fb-2f66-44cc-a8ec-6ef068b3d9d0

📥 Commits

Reviewing files that changed from the base of the PR and between 7711d41 and a9c0eb8.

📒 Files selected for processing (1)
  • site-workflow/pkg/grpc/client/instance.go

Comment thread site-workflow/pkg/grpc/client/instance.go
@thossain-nv thossain-nv force-pushed the chore/update-core-proto branch from a9c0eb8 to d854c97 Compare April 6, 2026 20:44
reserved 9; // host_nics in carbide-core, not used in REST
repeated ExpectedHostNic host_nics = 9;
optional common.RackId rack_id = 10;
// WARNING: Conflict with Core - optional bool default_pause_ingestion_and_poweron = 11;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chet Can we please resolve Core proto differences before these changes (v1.3.0) hit Production? Otherwise ExpectedMachine inventory might break in Prod.

Comment thread Makefile Outdated
rm -rf workflow-schema/schema/site-agent/workflows/v1/*.pb.go

core-proto-fetch:
if [ -d "nico-core" ]; then cd nico-core && git fetch origin && git reset --hard origin/main; else git clone ssh://git@github.com/NVIDIA/ncx-infra-controller-core.git nico-core; fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about allowing to use a custom 'nico-core' (using a link) with something like:
[ -h "nico-core" ] || if [ -d "nico-core" ]; then cd nico-core ... else git clone ...; fi

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems helpful. Let me look into it.

@thossain-nv thossain-nv force-pushed the chore/update-core-proto branch from 6177b94 to 9d0befc Compare April 8, 2026 16:51
Comment thread Makefile
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
…roup

Signed-off-by: Tareque Hossain <thossain@nvidia.com>
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
…ed merge

Signed-off-by: Tareque Hossain <thossain@nvidia.com>
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
Signed-off-by: Tareque Hossain <thossain@nvidia.com>
…EST snapshot

Signed-off-by: Tareque Hossain <thossain@nvidia.com>
@thossain-nv thossain-nv force-pushed the chore/update-core-proto branch from 9d0befc to bdb66e0 Compare April 8, 2026 20:16
@thossain-nv thossain-nv merged commit 371c219 into main Apr 8, 2026
51 checks passed
@thossain-nv thossain-nv deleted the chore/update-core-proto branch April 8, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants