All notable changes to NCX Infra Controller REST are documented in this file. Each release lists pull requests grouped by category, with the most recent version first.
-
Add support for filtering VPCs by NVLink Logical Partition (#380) VPCs can now be filtered by their associated default NVLink Logical Partition ID. Other VPC filters have also been enhanced to accept multiple values, and network security group filter validation has been improved. See the updated query parameters on Retrieve all VPCs.
-
Allow setting routing profile when creating VPCs (#350) Callers can now include
routingProfilewhen creating a VPC with FNN network virtualization. The field is only accepted for FNN-type VPCs and is reflected in the API response accordingly. See theroutingProfilefield on Create VPC. -
Allow power control to NSM and PSM without registration (#368) NVSwitch Manager and PowerShelf Manager can now receive power control commands (on/off/cycle) without requiring prior component registration, simplifying initial rack bring-up workflows.
-
Update Site Agent Helm chart to adopt Core prereqs for installation (#416) Updates the README and Helm chart to reference the
helm-prereqschart from infra-controller-core as the recommended installation path for bare-metal cluster setup. Also adds username keys to the common DB credentials secret template. -
Hint at label filter syntax after TUI list output (#406) After running a list command in the TUI, a context-sensitive hint now displays available label keys and the syntax for
--label,--sort-label, andscope labelfiltering. The hint is suppressed once any label filter is active. -
Add user-defined task schedules (#392) Extends the RLA scheduler to support user-defined task schedules, allowing operators to configure custom recurring jobs beyond the built-in inventory sync and leak detection schedules.
-
Add support for Delta PMC vendor in Powershelf Manager (#331) PSM now supports Delta as a power shelf vendor alongside the existing Liteon support, broadening hardware compatibility for power management.
-
Support explicit rule ID override in RLA sequence requests (#404) Callers can now specify an operation rule by ID when submitting rack operations, bypassing the normal priority chain (rack association, default, hardcoded) and using the requested rule directly.
-
Reject DB connection when no encryption (#428) Restores the PostgreSQL SSL mode from
disable(introduced accidentally in v1.1.0 during a DSN builder refactor) back toprefer, so that the client attempts TLS first and falls back gracefully. This fixes connection failures against Postgres servers that only accept encrypted connections viahostsslrules. -
Add Site Agent manager for VPC Peering and fix workflows (#424) Adds the missing VPC Peering manager to the Site Agent and fixes VPC Peering workflows to correctly require the VPC Peering ID during creation on Site.
-
Mark ipBlockId as required in VPC Prefix create request (#414) Corrects the OpenAPI schema and CLI/TUI to reflect that
ipBlockIdis a required field when creating a VPC Prefix, matching the server-side validation that already enforced this. See Create VPC Prefix. -
Prompt for allocation constraints in TUI allocation create (#413) Fixes the TUI allocation creation flow to properly prompt for the required constraint (resource type, IP block selection, constraint type, and value), which was previously missing.
-
Send protocolVersion and routingType on IP block create (#412) Fixes IP Block creation in the CLI/TUI by including the
protocolVersionandroutingTypeparameters that were previously omitted from the request, causing creation failures. -
Fix Expected Machine OpenAPI misnamed fields for BMC default credentials (#421) Corrects field names for BMC default username and password in the Expected Machine OpenAPI schema, resolving mismatches between the spec and the actual API behavior. See Expected Machine endpoints.
-
Resolve RLA inventory component manager ID sync issue (#409) Ensures machine IDs are synced on every inventory loop iteration, removing a conditional skip that could leave
external_idstale and cause leak detection to fail. Also updates default operation timeouts and fixes misleading error messages in component target resolution. -
Update SSH Key Group status after successful sync to Site (#411) Fixes a bug where the overall SSH Key Group status was not updated after a successful sync to a Site — only the per-site association status was being set, leaving the parent resource in a stale state.
-
Prevent duplicate --data flag panic on carbidecli create commands (#401) Fixes a panic in
carbidecli dpu-extension-service create(and similar commands) caused by a flag name collision when a request body property is nameddata. Colliding properties are now registered under abody-prefix. -
Resolve required query param for tenant-account list in TUI (#408) Fixes the TUI tenant-account list command that was failing due to a missing required query parameter.
-
Error when flags are placed after a positional argument in carbidecli (#400) Adds detection of misordered flags placed after positional arguments in
carbidecli, providing a clear error message instead of sending a malformed HTTP request. -
Remove deprecated Instance/Allocation relationships (#371) Completes the transition from per-instance allocation linkage to aggregate allocation enforcement. Instance creation now validates against total reserved capacity for a tenant's instance type at a site, and allocation constraint updates check against total instance usage. See the updated Allocation schema.
-
Move machine ID lock acquisition before record pull (#405) Fixes a race condition in instance creation where the machine record could become stale between initial read and lock acquisition. The lock is now acquired before pulling the record, ensuring all subsequent checks operate on current data.
-
Fix OpenAPI URL for IP Block, SSH Key/Group and Tenant Account in TUI/CLI (#397) Corrects 20 URL path segments in the TUI that were using hyphenated display names instead of the actual API paths (e.g.,
ip-blockinstead ofipblock), fixing silent 404s on list, get, create, update, and delete operations for these four resource types. -
Revise Allocation status enum and attribute descriptions in OpenAPI spec (#395) Aligns the Allocation status enum values in the OpenAPI schema with database constants (e.g.,
Registeredwas missing), fixing deserialization errors when listing Allocations. Adds comprehensive attribute descriptions across Allocation models. See the updated Allocation schema. -
Infer Provider/Tenant from org for Site update and Fabric retrieval endpoints (#372) Extends org-based identity inference to Site update and Fabric retrieval endpoints, removing the need to pass
infrastructureProviderIdortenantIdquery parameters. See updated parameters on Update Site and Retrieve all Sites. -
Support reflashing the same firmware version in PSM (#393) Allows PowerShelf Manager to re-apply the same firmware version that is already installed, enabling recovery scenarios where a re-flash is needed without a version change.
-
Include IP block flag in VPC prefix create log (#426) Updates the CLI hint text for VPC Prefix creation to include the required IP Block flag.
- Create generic Execute interface and workflow/activity registries in RLA (#419)
Introduces a generic
Executeinterface and type-safe workflow/activity registries in the RLA task executor, replacing ad-hoc registration with a structured pattern. Consolidates shared execution logic, adds comprehensive registry tests, and improves discoverability of available actions.
- Add Getting Started section in OpenAPI schema (#402) Adds a Getting Started section to the API documentation, providing a clear onboarding path for new users. HTTP 200 and 201 responses are now auto-expanded for better discoverability of response schemas.
-
Reduce tests duration by up to 36% (#431) Optimizes PostgreSQL test container configuration by trading durability for speed (disabling fsync, full page writes, and synchronous commit), reducing local test runs from ~10:38 to ~6:46.
-
Add Grype container vulnerability scan to build-push-service (#418) Integrates Grype as an additional container vulnerability scanner in the Docker build-and-push workflow, complementing existing security scanning.
-
Add GitHub workflow to ensure Core protobuf is up to date (#391) Adds a CI check that verifies generated protobuf code matches the current proto files, preventing drift between proto definitions and generated Go code.
-
Clean up deprecated workflows/activities in Site Agent (#375) Removes legacy workflows and activities from Site Agent that have been superseded by the Site Workflow module. Deletes custom proto objects that had drifted from Core. Retained workflows for Temporal CLI based deletion now have a
ByIDsuffix for clarity. -
Switch from deprecated attributes for VPC Prefix create/update request to Site (#423) Migrates VPC Prefix create and update API handlers from deprecated Core proto attributes to their current replacements.
-
Add back standard SDK module after repo rename (#265) Re-adds the standard SDK Go module that was removed during the repository rename, fixing import paths. SDK consumers should update imports to
github.com/NVIDIA/infra-controller-rest/sdk/standard. -
Configure NSM and PSM to run in-memory mode by default (#410) NVSwitch Manager and PowerShelf Manager now default to in-memory firmware storage mode, eliminating the PostgreSQL dependency for these services in standard deployments.
-
Update Instance creation API test, validate unhealthy Machine flag sent to Core (#407) Strengthens Instance creation tests to verify the
allowUnhealthyMachineflag is correctly forwarded to Core. -
Fix idempotency issue for warning comments in Core proto format script (#389) Makes
make core-proto-fmtfully idempotent by preventing duplicate warning comment and block insertion on repeated runs. -
Re-generate Core protobuf to align with latest proto files (#390) Runs
make core-protogento sync generated Go code with the current*_carbide.protodefinitions on the main branch.
-
Add system job scheduler in RLA with trigger and overlap policies (#352) Replaces ad-hoc inventory sync and leak detection go-routines with a structured scheduling framework. Each job is defined with a configurable trigger (timer, cron, trigger-once, or event-driven), an overlap policy, and a worker, providing graceful and forceful shutdown support.
-
Add support for updating InfiniBand Partition data on Site (#334) Implements end-to-end InfiniBand Partition update propagation to the Site Controller. The API handler now starts a site workflow after a successful update to REST DB cache, wiring through proto definitions, Temporal workflows, and activities consistent with the existing create/delete patterns.
-
Add net.HardwareAddr wrapper for BMC MAC JSON marshaling (#369) Introduces a
net.HardwareAddrwrapper type that provides proper JSON marshaling and unmarshaling for BMC MAC addresses, replacing raw byte-slice serialization with human-readable colon-separated format.
-
Include name in update request of NVLink Partition Update (#373) Ensures new or existing partition name is included in NVLink Logical Partition update requests to Site, since Site expects the update request to reflect the full data.
-
Require TLS certs by default for RLA/PSM/NVSM and IPAM server (#333) RLA, PSM, and NSM now refuse to start without TLS certificates unless
ALLOW_INSECURE_GRPC=trueis explicitly set, hardening the default security posture. Also IPAM gRPC server now supports/requires TLS specification. -
Update default firmware update sequence for NSM to only include BMC and BIOS updates (#376) Narrows the default NSM firmware update sequence to BMC and BIOS components only, excluding unnecessary sub-component updates that could cause longer maintenance windows.
-
Prepare for Machine/InstanceType Association ID deprecation (#367) Adds Machine ID as a replacement for Instance Type/Machine Association ID for removal of assignment, introduces a dated deprecation window for association IDs and enabling clients to migrate smoothly.
-
Include NVLink and InfiniBand Interfaces while cleaning up Instance resources (#366) Fixes instance termination cleanup to also delete associated NVLink and InfiniBand interfaces, preventing orphaned network interface records. Includes a DB migration to remove previously orphaned interfaces.
-
Harden scheduler dispatcher correctness and unit tests of RLA (#364) Eliminates shared-state race conditions in the scheduler dispatcher by using
forceCtxas the parent for all job contexts, fixes event-draining on queue exhaustion, and replaces timing-sensitive tests with deterministic assertions. -
Added status field in NVLink Interface summary API model (#363) Adds the missing
statusfield to the NVLink Interface summary API model, allowing consumers to view interface status when listing NVLink Interfaces within an NVLink Partition. -
Fix bringup sequence, NSM stale records, and unify tray/rack type enums (#377) Addresses several issues found during rack bring-up and firmware update testing: replaces the default BringUp rule's ingestion-based power-on with standard
PowerControlto avoid BMC MAC lookup failures, restructures firmware upgrade sequencing from parallel to staged execution (compute then NVLSwitch then power recycle), fixes Temporal serialization loss ofFirmwareControlTaskInfoacross child workflow boundaries, filters stale firmware update records in NSM'sGetUpdatesForSwitchto prevent old failures from masking current successes, and unifies component type enum naming across Tray and Rack API endpoints to PascalCase (Compute,NVLSwitch,PowerShelf, etc.). -
Add dev mode for RLA service (#360) Introduces an
RLA_ENVenvironment variable that gates development-only features: gRPC reflection is enabled only in dev mode, and the log level defaults to debug in dev mode versus info in production, preventing accidental exposure of diagnostic interfaces in deployed environments. -
Skip config filter in DB if no config query params are set when retrieving all Sites (#379) Fixes a bug where the Site list handler unconditionally applied an empty JSONB config filter, causing sites with a NULL config column to be silently excluded from results. Site listing now only applies config filtering when at least one config query parameter is explicitly provided.
-
Maintain association record when Instance Type is updated in Machine inventory (#383) When a Machine's Instance Type changes during inventory sync, the Machine/InstanceType association record is now updated alongside the Machine attribute itself, keeping both representations consistent until the association ID is fully deprecated.
-
Have PSM read firmware files at startup time rather than using an embedded filesystem (#385) Switches PowerShelf Manager from compile-time embedded firmware binaries to runtime file loading at startup, allowing firmware images to be updated by replacing files on disk without recompiling the service.
- Require Ready status for targeted machine instance creation (#357)
Targeted instance creation now enforces that the specified machine must be in
Readystatus or inError(health alerts) orMaintenancestatus with the Core state beingReady(whenallowUnhealthyMachineflag is set).
-
Replace hardcoded API name in path in TUI using helper (#356) Replaces all 86 hardcoded
/v2/org/{org}/nico/...path strings in the TUI with calls to a newapiPathhelper, making path construction consistent with the SDK's configurable API name support. -
Rename Site Agent and mock Core/RLA server binary (#365) Renames Site Agent and mock server binaries as part of the Site Agent v2 preparation, and removes residual database references from the stateless agent.
-
Update Core proto and improve firmware update sequencing in RLA (#361) Aligns RLA snapshot of Core proto, improves firmware version matching between input requests and observed state, and enables RLA to update the
firmware_autoupdateflag for machines. -
Update Core proto snapshot for REST components (#251) Introduces an idempotent
make core-protoscript that automates Core proto file snapshotting with handling for backwards-incompatible changes and REST-specific additions. Also removes deprecated non-paginated object retrieval fallback methods from Site Agent. -
Add changelog with detailed history of released tags up to v1.2.1 (#359) Adds a comprehensive CHANGELOG.md with professional descriptions for every pull request across all 12 released versions.
v1.2.1 — 2026-04-07
-
Allow Instance Interfaces to span multiple VPCs (#300) Instances can now attach interfaces to VPC prefixes from different VPCs, enabling multi-VPC networking per instance. The primary interface must still belong to the instance's primary VPC, and NSG propagation status now reflects the aggregate state across all attached VPCs.
-
Infer Tenant/Provider context from org when retrieving Allocation/Instance Type (#217) Removes the requirement for callers to pass explicit
infrastructureProviderIdandtenantIdquery parameters on Allocation and Instance Type endpoints by inferring identity from org membership. Dual-role users now receive a merged view from both provider and tenant perspectives in a single call. -
Modify NSM and PSM Vault credential format to match NICo's standard pattern (#341) Aligns the credential format that NVSwitch Manager and PowerShelf Manager expect in Vault with the standard credential pattern used by NICo, ensuring consistency across component managers.
-
Revise PatchComponent, add DeleteRack, PurgeRack, PurgeComponent gRPC APIs in RLA (#320) Extends RLA gRPC APIs with soft-delete and permanent purge operations for racks and components. PatchComponent now supports BMC updates, improving rack lifecycle management capabilities.
-
Add Tenant Account search query and fix tsquery parsing (#315) Implements the
queryparameter for tenant account listing, enabling search by account number, tenant org, or display name. Also fixes a bug where multiple consecutive spaces in search queries would generate invalid PostgreSQLto_tsquerysyntax, causing 500 errors.
-
Aggregate NVSwitch sub-component firmware statuses in NICo path in RLA (#355) Fixes a map-overwrite bug where only the last sub-component firmware status survived per switch when Core returned multiple statuses (BMC, CPLD, BIOS, NVOS). An aggregation function now correctly reports failure if any sub-component fails, and missing switches are reported as Unknown.
-
Prevent UpdateTaskStatus from overwriting started_at with NULL in RLA task executor (#354) The
started_atcolumn was unconditionally included in UPDATE statements, causing finished-status writes to overwrite the stored timestamp with NULL. The column list is now built dynamically sostarted_atis only set during the Running transition. -
Align PMC Vault path with Core/NSM (#353) Corrects the Vault path used for storing PMC credentials in PowerShelf Manager to match the convention used by Core and NVSwitch Manager, resolving credential lookup failures.
-
Correct child workflow timeout/error and improve compute firmware update lifecycle in RLA (#348) Fixes timeout budget miscalculation where child workflows shared the same timeout as individual activities, leaving no room for retries. Also adds fail-fast firmware version validation against Core, idempotent scheduling for already-complete machines, and proper error attribution when steps are skipped.
-
Make PSM registration credentials-optional and idempotent (#347) PowerShelf Manager registration no longer requires credentials upfront, and re-registration of the same shelf is now safely idempotent rather than returning an error.
-
Only report inherited VPC propagation when interfaces are attached (#342) NSG inheritance from parent VPCs is now only reported when interfaces are actually attached to instances, preventing misleading propagation status on instances with no active interfaces.
-
Prevent modification of virtualization type for VPCs with Subnets or Instances (#343) The API now rejects virtualization type changes for tenant-owned VPCs that already have attached subnets or instances, preventing disruptive configuration changes on in-use VPCs.
-
Normalize MAC address case in NSM/PSM vault credential lookups (#345) Fixes a case mismatch between Go's lowercase MAC addresses and Core's uppercase Vault paths that caused credential lookups to silently fail in NVSwitch Manager and PowerShelf Manager.
-
Auto-migrate NSM database schema on startup (#340) NVSwitch Manager now applies database migrations automatically on startup, matching the existing behavior of PowerShelf Manager and eliminating manual migration steps.
-
Add health-report override and idempotent power-option handling in RLA (#335) RLA power control now properly marks machines with health-report overrides before operations and cleans them up afterwards. Power state transitions that are already in the desired state are treated as no-ops instead of failing the entire operation.
-
Add 10 MiB request body size limit to prevent OOM (#330) Adds a global 10 MiB request body size limit using Echo's BodyLimit middleware, preventing the audit body middleware from buffering arbitrarily large payloads and eliminating an OOM crash vector for authenticated mutating endpoints.
-
Infer Provider and Tenant ID from org association when creating Operating Systems (#344) Removes unnecessary
TenantIDandInfrastructureProviderIDparameters from OS creation, since both are already derivable from the caller's org association. -
Add HTTP read/write/idle timeout in Echo config (#339) Configures well-defined read, write, and idle timeouts for the HTTP server, mitigating potential Slowloris-style denial-of-service attacks.
-
Tune RLA power operation timer configuration (no PR) Adjusts timer settings for RLA power operations to better accommodate real-world operation durations.
-
Add detailed review config for CodeRabbit (#337) Updates CodeRabbit configuration to provide more contextual and well-informed automated code reviews.
-
Add codeowners file to repository (#295) Introduces a CODEOWNERS file so reviewers are automatically assigned to pull requests based on file ownership.
-
Optimize NVLink Logical Partition lookup in Instance update API handler (#332) Replaces inefficient one-at-a-time DB lookups of NVLink Logical Partitions with a batch query, and adds proper distinction between internal DB errors and non-existent partitions.
v1.2.0 — 2026-03-31
-
Add full CRUD parity for TUI interactive mode (#305) Extends the CLI's interactive TUI with create, update, and delete commands for all major resource types including sites, VPCs, subnets, instances, allocations, and more. Instance creation scopes machine selection to the VPC's site for a more intuitive workflow.
-
Add label display, filtering, and sorting to TUI (#306) Adds a LABELS column to all label-bearing resources in the TUI, introduces persistent label scope filtering via
scope label key=value, per-command--labelfiltering with AND logic, and--sort-labelsorting. Also addsinstance-type listandinstance-type getcommands. -
Add interactive instance create/delete and keybinding help (#294) Adds guided interactive flows for creating and deleting instances in the TUI, with VPC and machine selection, name prompts, and optional OS selection. Includes escape-to-cancel support and updated keybinding documentation.
-
Add configurable API name support to SDK REST clients (#322) Introduces an
APINameconfiguration hook in both the standard and simple SDK clients, allowing callers to target deployments that use a non-default API path segment without modifying generated code. -
Update RLA inventory sync to use Core for switch and powershelf management (#321) RLA inventory sync now routes through Core when the component manager for switches and powershelves is configured accordingly, enabling unified inventory management.
-
Add DPU extension observability config options (#291) Adds support for Prometheus and logging observability configuration when creating or updating DPU extension services, enabling operators to instrument DPU workloads at deployment time.
-
Dual-write Expected Inventory REST APIs into Core and RLA (#303) Expected Inventory API calls now write to both Core and RLA, ensuring that the Launch Layer's inventory data reaches both the cloud orchestration layer and the on-site rack-level administration system.
-
Add pagination fallback in RLA gRPC (#318) RLA now falls back to default pagination parameters when upstream callers omit them, preventing panics from nil pagination structs. Also removes obsolete code.
-
Trigger power-off via task manager on leak detection (#308) When a coolant leak is detected in a tray, the system now automatically triggers a power-off operation through the RLA task manager, providing a safety response to prevent hardware damage.
-
Add NVLink Switch inventory sync with NSM registration and drift detection (#298) Implements automatic NVLink Switch inventory synchronization with NVSwitch Manager registration and configuration drift detection, ensuring switch state is kept current.
-
Add REST API models and endpoints for VPC peering (#257) Introduces API models and CRUD endpoints for VPC peering, enabling cross-VPC network connectivity between tenants or within a provider's infrastructure.
-
Initial implementation of tray leak detection (#297) Adds the foundational leak detection subsystem for monitoring coolant leaks in rack trays, providing the sensor data pipeline for automated safety responses.
-
Add mTLS support to RLA CLI and refactor cert packages (#299) Enables mutual TLS authentication for RLA CLI commands and consolidates certificate handling into a shared package, aligning with the security posture of other NICo services.
-
Add AGENTS.md info file for repo (#292) Adds an AGENTS.md file providing comprehensive guidance for AI coding agents working in the repository, covering project structure, build commands, coding conventions, and CI/CD workflows.
-
Support task conflict detection on component-level (#278) Extends the RLA task framework to detect conflicts at the individual component level, preventing overlapping operations on the same hardware component.
-
Add API model and endpoint for retrieving RLA Tasks (#252) Exposes RLA task status and history through the REST API, allowing users to track the progress of rack-level operations such as firmware upgrades and power control.
-
Support firmware upgrades in-memory mode for PSM (#277) Adds support for in-memory firmware upgrade mode in PowerShelf Manager, providing a faster firmware update path for supported hardware.
-
Implement NICo provider for NVLSwitch and PowerShelf component managers (#256) Introduces a NICo-backed provider for managing NVLink Switches and PowerShelves, enabling these component types to be managed through the standard NICo Core API path.
-
Refactor task workflow and component manager in RLA (#269) Restructures the RLA task workflow execution and component manager architecture for better maintainability and extensibility.
-
Add Site-Agent bootstrap hook support in helm chart (#263) Adds support for custom bootstrap hooks in the Site-Agent Helm chart, enabling site-specific initialization logic during deployment.
-
Allow explicit IP selection when creating/updating instances (#271) Enables callers to specify explicit IP addresses when creating or updating instances, rather than relying solely on automatic allocation from VPC prefixes.
-
Populate APIError source attribute from API name (#287) Replaces the hardcoded
nicosource in structured API errors with the configured API name, ensuring error responses accurately identify the originating service. This is a breaking change for clients that matched onsource == "nico". -
Strip leading comma from Docker image tags in CI (#323) Fixes the CI Docker tag generation that produced malformed tags like
,nvcr.io/.../image:tagdue to empty string initialization with comma-prefixed appends. -
Load RLA Temporal client certificates using cert module (#319) Generates certificate and key file paths using the Kubernetes workload standard convention, fixing TLS certificate loading for RLA's Temporal client connections.
-
Support clearing NVLink Logical Partition ID in VPC update (#284) Allows passing an empty string for NVLink Logical Partition ID in VPC update requests to explicitly clear the default partition assignment.
-
Fixed legacy RLA and PSM makefiles VERSIONS usage and removed obsolete dockerfiles (#286) Corrects how legacy RLA and PSM makefiles reference VERSION files and removes obsolete Dockerfiles that were no longer in use.
-
Reset IsUsableByTenant when machine goes missing on Site (#317) When a machine stops being reported by the Site Controller,
IsUsableByTenantis now correctly reset tofalsealongside the status change, preventing stale tenant usability flags in the API response. -
Return all trays when rackId/rackName not specified in GET /tray (#312) Fixes a 500 Internal Server Error when querying trays without specifying
rackIdorrackName; the API now correctly returns all trays across all racks in the site. -
Defaulted native_networking and network_security_group to true (#309) Updates default site configuration to enable native networking and network security group support for newly created sites, as all current deployments require these features.
-
Delete NVLink interfaces when Site reports config as synced (#267) NVLink interfaces are now properly cleaned up in the database when the Site Controller reports configuration as fully synced, resolving stale interface records.
-
Prevent VPC Prefix deletion when Instance Interfaces are present (#285) Blocks deletion of VPC prefixes that have active Instance Interfaces using them, preventing orphaned network references.
-
Enable NVLink deviceType specification in Instance Type update (#283) Allows specifying NVLink
deviceTypewhen updating Instance Types, and permits NVLink deviceType on GPU capabilities, fixing validation gaps in the Instance Type configuration. -
Permit NVLink deviceType on GPU capabilities for instance type (#280) Removes an incorrect validation that rejected NVLink deviceType when specified alongside GPU capabilities on Instance Types.
-
Empty array must be passed to underlying SDK when caller specifies one (#275) Fixes a bug where explicitly passing an empty array in API requests was silently ignored instead of being forwarded to the SDK, preventing callers from clearing list fields.
- Exclude ingestion from default bringup sequence rule (#311) Removes ingestion from the default bringup sequence in RLA rules, allowing ingestion to be handled separately from the standard rack bringup workflow.
- Fix routing type enum in IP Block create request schema (#313) Corrects an extraneous space in the routing type enum value within the OpenAPI schema for IP Block creation requests.
- Remove trivy scan job (#307) Removes the Trivy container vulnerability scanning job from the CI pipeline.
-
Remove email from OpenAPI specs and auto-generated files (#302) Strips personal email addresses from the OpenAPI specification and all associated auto-generated files.
-
Move common and cert-manager to subchart (#290) Restructures the Helm chart to package common utilities and cert-manager as subcharts, improving modularity and deployment flexibility.
-
Update issue templates to standardize and remove duplicates (#288) Consolidates and standardizes GitHub issue templates, removing duplicate templates and ensuring consistent contributor experience.
v1.1.0 — 2026-03-17
-
Allow different Logical Partitions for NVLink Interfaces on Instance creation/update (#225) NVLink Interfaces no longer need to share the same Logical Partition; each interface can now reference a different partition. Validation has been tightened to require unique GPU indices within machine bounds, and duplicate detection now uses partition + deviceInstance composite keys.
-
Add optional DB secret volume mount in Helm Chart (#260) Adds a
secrets.dbCredsoption to the Helm chart that allows reading the database password from a mounted Kubernetes Secret instead of plaintext in ConfigMap, enabling secure production deployments. -
Add task conflict detection and queue framework (#233) Introduces a task conflict detection and queuing framework in RLA that prevents overlapping operations on the same resources, ensuring safe concurrent task execution.
-
Add total assigned Machines to Instance Type allocation stats (#245) Includes the total number of assigned machines in Instance Type allocation statistics, giving providers better visibility into resource utilization per instance type.
-
Standardize API enum formatting and extend Tray response fields (#241) Normalizes enum value formatting across API responses and adds additional fields to Tray endpoint responses for richer component metadata.
-
Allow filtering Machines by whether missing on site (#243) Adds a query parameter to filter machines by their
isMissingOnSitestatus, enabling operators to quickly identify machines that have stopped reporting from the site. -
Add REST endpoints for ExpectedPowerShelf and ExpectedSwitch (#220) Introduces CRUD REST API endpoints for managing expected power shelf and switch inventory, maintaining parity with the existing ExpectedMachine endpoints.
-
Add API endpoints for RLA rack bring up (#206) Exposes REST API endpoints for initiating rack bringup operations through RLA, enabling orchestrated rack-level provisioning workflows.
-
Port NV-Switch manager into the repository (#192) Integrates the NVSwitch Manager service directly into the nico-rest repository, consolidating switch management alongside other component managers.
-
Add stale issue or PR check workflow (#221) Introduces an automated GitHub workflow that identifies and labels stale issues and pull requests, helping maintain repository hygiene.
-
Add simple Go SDK focused on easier Instance creation (#177) Provides a streamlined Go SDK at
sdk/simple/with a simplified API surface focused on common Instance creation workflows, complementing the full-featured generated SDK.
-
Added API endpoints for listing InfiniBand and NVLink Interfaces across Instances (#218) Adds cross-instance listing endpoints for InfiniBand and NVLink Interfaces with filtering by Instance ID and partition ID, resolving gaps in network interface discoverability.
-
Invalidate all scope-filtered resource types on scope change in CLI (#249) Fixes stale data in the TUI by invalidating all cached scope-filtered resources when the user changes their site or VPC scope.
-
Include scope args in CLI interactive mode command printout (#248) Scope arguments are now included in the command printout during interactive mode, making it clear which site/VPC context is active.
-
Add mock methods for ExpectedPowerShelf and ExpectedSwitch to NICoTest (#239) Adds the missing mock implementations needed for testing ExpectedPowerShelf and ExpectedSwitch handlers.
-
Unwrap full Temporal error chain in UnwrapWorkflowError (#240) Ensures the complete Temporal error chain is unwrapped in workflow error handling, providing accurate error messages to API callers instead of generic wrapper errors.
-
Use proto field getters in Site workflow logs to prevent nil panics (#238) Replaces direct proto field access with getter methods in Site workflow logging, preventing nil pointer panics when optional fields are absent.
-
Omit terminating DPU Extension Service deployments in update request to Site (#219) Filters out DPU Extension Services that are in a terminating state from update requests sent to the Site Controller, preventing conflicts with in-progress deletions.
-
Handle nil gRPC client in Site Agent workflows (#235) Adds nil checks for the RLA gRPC client in Site Agent workflows, preventing panics when the RLA integration is disabled.
-
Correct NSM binary path in Dockerfile for CI extraction (#234) Fixes the binary path in the NVSwitch Manager Dockerfile to allow CI to correctly extract the built binary for artifact upload.
-
Rename Bare Metal Manager module/references to NCX Infra Controller (#262) Updates the Go module path and all documentation references from "Bare Metal Manager" to "NCX Infra Controller", reflecting the project's official naming.
-
Update NICo proto in RLA and refactor RLA inventory loop sync (#253) Refreshes proto definitions from bare-metal-manager-core and refactors the inventory sync loop to eliminate a redundant
FindMachinesByIdscall. Firmware version syncing is extracted into a dedicated function. -
Rename ExpectedPowerShelf and ExpectedSwitch IDs (#258) Renames the
idfields on ExpectedPowerShelf and ExpectedSwitch to more descriptive names aligned with NICo Core, and adds missing handler tests. -
Pass a JSON-safe struct to executor instead of domain Rack (#236) Replaces the domain Rack object with a serialization-safe struct when passing data to task executors, preventing JSON marshaling issues in Temporal workflows.
-
Adopt the common.SetupHandler for more handlers (#231) Migrates additional API handlers to use the standardized
common.SetupHandlerpattern, reducing boilerplate and improving consistency across the codebase. -
Consolidate duplicated database access and Temporal config code (#193) Extracts shared database connection and Temporal configuration code into common packages, eliminating duplication across services.
-
Create wrapped errors instead of formatted ones (#229) Replaces
fmt.Errorfwitherrors.Wrapfor proper error chaining, improving debuggability through preserved error cause chains. -
Clean up duplicate package imports (#227) Removes redundant package import aliases and consolidates inconsistent import styles across the codebase.
-
Rename legacy cloud-api references to nico-rest-api (#226) Updates remaining references from the legacy "cloud-api" naming to "nico-rest-api" for consistency with the current project identity.
-
Add deployment guide based on latest kustomization logic (#228) Adds a comprehensive deployment installation guide reflecting the current Kustomize-based deployment structure and component dependencies.
-
Add site machineStats to schema (#244) Documents the
machineStatsfield in the site response schema, making the machine statistics structure discoverable in the OpenAPI spec. -
Update auth documentation to match latest deployment manifests (#224) Revises authentication documentation to accurately reflect the current Keycloak deployment configuration and auth flow.
-
Update CodeRabbit config to prevent PR description update, reduce noise (#255) Disables CodeRabbit's PR description updates that were causing the PR Title Checker workflow to stall on description change events.
-
Enable CodeRabbit PR review (#250) Enables CodeRabbit.ai for automated code reviews on submitted pull requests, replacing intermittent Copilot functionality.
-
Upgrade Trivy action version (#222) Updates the Trivy security scanning action to the latest version for improved vulnerability detection.
-
Fix Trivy update comment logic (#210) Corrects the logic for updating Trivy scan results in PR comments to prevent duplicate comment creation.
-
Support helm deploy on KinD in makefile (#232) Adds Makefile targets for deploying the Helm chart to a local KinD cluster, streamlining the local development workflow.
-
Improved handling of a few swallowed errors (#230) Surfaces several previously swallowed errors in handler and workflow code, improving observability of failure conditions.
-
Rename site-agent chart deployment to statefulset (#208) Changes the Site Agent Helm chart from a Deployment to a StatefulSet, better reflecting the agent's stateful operational requirements.
v1.0.6 — 2026-03-06
-
Disable Image-based OS for Instance creation/update (#176) Temporarily disables image-based OS selection for instance provisioning due to unresolved issues with SOL-based SSH access and URL accessibility from remote sites.
-
Add rla rack create CLI command and examples directory (#191) Introduces a
rla rack createCLI command that accepts JSON file or raw JSON data input, along with an examples directory containing sample rack configurations for GB200 NVL72 racks. -
Add RLA IngestRack API for injecting expected components (#189) Implements the rack ingestion feature that reads component data from RLA's database and routes it to the appropriate component manager (NICo for compute, PSM for power shelves), following the existing task framework pattern.
-
Add NVSwitch Manager plugin for RLA (#172) Adds NVSwitch Manager as a new backend for managing NVLink Switch components within the Rack Level Administration system.
-
Add Go sub-module for SDK at sdk/standard/ (#187) Creates a standalone Go sub-module for the generated SDK, allowing downstream providers to import the API client without pulling in the full module's heavy dependencies (Postgres, Redis, Temporal, etc.).
-
Added support for explicit VPC ID and VNI (#166) Enables API consumers to specify explicit VPC IDs and VNIs during VPC creation instead of relying solely on auto-allocation. Separates requested VNI from active VNI in the API, database, and workflow layers.
-
Add bmmcli TUI interactive mode with config selector (#165) Introduces an interactive REPL mode for the CLI with multi-environment config switching, inline autocomplete, scope filtering, org switching, and command history. Supports all 20+ resource types with zero external TUI dependencies.
-
UpdateInstance API: Preserve DPU extension fields if unset (#214) Fixes a bug where omitting DPU extension service fields in an Instance update request would remove existing extensions, since the REST API's partial-update semantics weren't properly converting to Core's whole-replace gRPC API. Omitted fields are now preserved, and explicitly empty arrays trigger removal.
-
Set NVLinkPartition status based on response from Site (#134) NVLink Logical Partition creation now correctly persists the status returned by the Site Controller (e.g., "ready") instead of relying on the initial creation state.
-
Updated stats pagination limit, maxAllocatable formula, and decommissioned machine handling (#195) Fixes several issues in instance type statistics: resolves a pagination bug that capped results at 20 items, corrects the maxAllocatable calculation, and excludes decommissioned machines from allocatable counts.
-
Allow longer prefix length for VPC prefixes (#179) Extends the maximum VPC prefix length to allow /31 prefixes, matching the per-instance /31 allocation size used in VPC virtualization.
-
Added missing entries for RLA and PSM in build-push-docker.yaml (#181) Adds the missing Docker build and push entries for RLA and PowerShelf Manager in the CI workflow configuration.
- Refactor Firmware and BringUp workflows to rule-based execution (#190) Migrates FirmwareControl and BringUp workflows from hardcoded sequential logic to the same rule-based execution pattern used by PowerControl, extracting shared stage iteration logic into a common helper.
- Correct schemas for Rack/Tray endpoint response examples in OpenAPI spec (#196) Fixes Location schema references and adds BMC data in component response examples. Also relaxes the CI check to not require SDK regeneration for example-only changes.
- Update style check workflow to error on
go fmtfailure (#194) Changes the formatting check from a silent pass to a proper error, ensuring CI fails when code doesn't conform togo fmtstandards.
-
Move dynamic TLS handler to common package (#207) Relocates the dynamic TLS certificate loader from Cert Manager to the common package, making it reusable by the API, Workflow service, and the next-generation Site Agent.
-
Add pull request template (#205) Introduces a standardized pull request template to ensure consistent information is provided by contributors when opening PRs.
-
Remove DB references from Site Agent (#203) Removes residual database references from the Site Agent, which operates as a stateless service with no direct database access.
-
rename bmmcli to cli (#188) Renames the CLI binary from
bmmclitocli, updating the package name, Makefile target, shell completion scripts, and documentation. -
Add bare metal manager rest helm chart (#186) Introduces the initial Helm chart for deploying the full NICo REST stack to Kubernetes, including API, workflow, cert-manager, site-agent, and mock-core components.
-
Update publish chart jobs (#199) Updates CI jobs for Helm chart publishing with revised configurations.
-
Update helm jobs to workflow (#198) Migrates Helm deployment jobs to reusable GitHub Actions workflows for better maintainability.
-
Add back kustomization per-components (#148) Restores per-component Kustomize structure and kind cluster setup scripts to closely resemble production deployments, with proper installation ordering and mTLS certificate fetching for Site Agent.
v1.0.5 — 2026-03-02
-
Allow filtering Machines by the presence of associated Instance (#180) Providers and privileged Tenants can now filter the machine list to show only machines with or without associated instances, aiding capacity management.
-
Handle duplicate key error on SSH Key Group creation by switching to sync workflow (#113) Switches SSH Key Group creation from asynchronous to synchronous workflow execution, enabling proper error handling for duplicate key conflicts and intelligent retry logic.
-
Allow providers to filter Machines by Tenant IDs (#168) Adds a Tenant ID filter to the Machine list endpoint, allowing Providers to quickly determine which machines are currently allocated to a specific Tenant.
-
Add RLA rule engine logic (#167) Introduces the rule engine for RLA task execution, enabling configurable operation sequences for rack-level operations with support for custom step ordering and conditional execution.
-
Add Powershelf Manager service as a component (#161) Integrates the PowerShelf Manager (PSM) service into the repository as a first-class component manager for power shelf hardware.
-
Add Rack Level Administration site config flag to gate RLA endpoints (#162) Adds a
RackLevelAdministrationboolean flag to site configuration that controls access to all rack and tray API endpoints, returning 412 Precondition Failed when not enabled. -
Add API endpoints for RLA power control and firmware update (#157) Introduces eight new REST API endpoints for power control and firmware upgrade operations on racks and trays, supporting both single-resource and batch operations with configurable power states.
-
Add site-level stats endpoints for GPU, instance type, and tenant allocation (#137) Exposes aggregate statistics at the site level for GPU utilization, instance type allocation, and per-tenant machine assignment.
-
Add bmmcli — OpenAPI-driven CLI for Bare Metal Manager REST API (#160) Introduces a fully OpenAPI-driven CLI that dynamically builds commands from the embedded API spec at startup, covering all 124 operations. Features include OIDC login, NGC API key exchange, auto-pagination, and multiple output formats.
-
Add Tray validation REST API endpoints (#156) Adds tray validation endpoints that compare expected versus actual component state through RLA, following the same pattern as existing rack validation.
-
Add unknown query parameter validation for handlers (#149) Rejects requests containing unrecognized query parameters with a 400 Bad Request, preventing typos from being silently ignored and improving workflow deduplication accuracy.
-
Add API models and endpoints for Tray management (#128) Introduces GET endpoints for retrieving individual trays and listing trays with rich filtering support (by rack, type, component ID, and task ID), bridging the REST API with RLA's tray data.
-
Add rack validate REST API endpoints (#122) Adds rack validation endpoints that invoke RLA's ValidateRackComponents workflow to verify expected versus actual component state.
-
Add labels support in InstanceType (#102) Adds label (key-value metadata) support to Instance Types in both the API and database models, enabling richer categorization and filtering.
-
Relocate RLA code to bare-metal-manager-rest (#119) Transfers the Rack Level Administration codebase from its internal repository into the main nico-rest repo, consolidating all management services in one place.
-
Add generated Golang client for OpenAPI spec (#129) Generates and checks in a Go API client from the OpenAPI specification, providing the foundation for the CLI tool implementation.
-
Aligned InfiniBand Partition proto snapshot with Core (#164) Synchronizes the InfiniBand Partition protobuf definitions with NICo Core, fixing an invalid pkey value error caused by proto misalignment.
-
Unify machine status breakdown into single reusable type (#171) Consolidates multiple machine status breakdown representations into a single reusable type, eliminating inconsistencies across different API endpoints.
-
Replace nico with NICo in Expected Machine/Audit API route prefix (#158) Updates the legacy
/nico/route prefix to/nico/for Expected Machine and Audit endpoints across the OpenAPI spec, SDK, and handler code. -
Correct binary_name and binary_path for nico-rla build (#153) Fixes a copy-paste error where the RLA build job used the cert-manager binary path, causing CI to fail when extracting the built binary.
-
Verify if VPC Prefixes are present before allowing network Allocation deletion (#132) Adds a safety check to prevent deletion of network Allocations that still have associated VPC Prefixes, avoiding orphaned network resources.
-
Resolve data race in RLA workerpool metrics (#145) Fixes a data race condition in RLA's workerpool metrics by using consistent atomic synchronization, resolving intermittent test failures.
-
Fix link to Core repo in README (#146) Corrects a broken documentation link to the BMM Core repository in the README.
-
Fix invalid NVLink Interface order, DPU Extension Service delete code in schema (#150) Corrects ordering and response code errors in the OpenAPI schema for NVLink Interface and DPU Extension Service endpoints.
-
Fix NVLink Interfaces endpoint parameters (#147) Corrects parameter definitions for NVLink Interface endpoints in the OpenAPI schema.
-
Fix schema tag for NVLink Interfaces endpoint (#141) Fixes incorrect schema tags on the NVLink Interfaces endpoint in the OpenAPI specification.
-
Add check to ensure SDK and docs are regenerated when OpenAPI spec changes (#174) Adds a CI check that validates generated SDK files and documentation stay in sync with OpenAPI spec changes, with clear guidance when regeneration is needed.
-
Remove duplicate Slack notification trigger in new PR workflow (#159) Removes the redundant
pull_requesttrigger that was duplicating Slack notifications already handled bypull_request_target. -
Enable Slack notification for all PRs (#152) Switches to
pull_request_targetto safely support forked PR notifications, updates the Slack message format, and adds automatic cancellation of superseded CI pipelines.
-
Allow privileged Tenants to retrieve all Sites (#144) Enables Tenants with targeted Instance creation capability to retrieve all Sites owned by Providers they have Tenant Accounts with.
-
Consolidate kustomize objects into nico-rest namespace (#140) Consolidates all Kustomize objects into a single
nico-restnamespace for cleaner deployment organization. -
Update RLA Dockerfile ldflags to use bare-metal-manager-rest module path (#136) Updates Go linker flags in the RLA Dockerfile to use the correct module path after the repository migration.
-
Release SDK for version 1.0.4 (#142) Publishes the generated SDK matching the v1.0.4 API surface.
-
Fix role name and grammar in OpenAPI spec (#139) Corrects the Provider viewer role name and fixes grammatical errors in the OpenAPI specification.
-
Rename local deployment elements (#138) Renames internal deployment components (elektraserver to mock-core, cluster name to nico-rest-local) for clearer naming.
-
Remove vault references in setup (#135) Cleans up remaining Vault references from the local setup scripts after the migration to native Go PKI.
v1.0.4-rc1 — 2026-03-05
- Aligned InfiniBand Partition proto snapshot with Core (#164) Backport of the InfiniBand Partition proto alignment fix from v1.0.5, resolving pkey value parsing errors caused by proto definition mismatches with NICo Core.
v1.0.4 — 2026-02-17
- Fix version extraction from VERSION file with copyright notice (#127) Corrects the CI version extraction logic that broke after a copyright notice was added to the VERSION file, ensuring the build pipeline correctly parses the semantic version.
v1.0.3 — 2026-02-13
-
Add make command to publish rendered OpenAPI schema to docs/pages (#116) Adds a
make publish-openapicommand that renders the OpenAPI schema to GitHub Pages, providing a publicly accessible API reference. -
Add support to filter sites by NVLink Partition (#101) Enables filtering the Sites list by NVLink Partition, helping users locate sites with specific NVLink configurations.
-
Add API models and endpoints for Rack management (#79) Introduces REST API endpoints for reading rack inventory via RLA, including single-rack retrieval and list operations with Temporal-based site routing.
-
Use NVLink domain ID instead of rack for batch instance topology optimization (#90) Switches batch instance creation topology optimization from rack-based placement to NVLink domain ID-based placement, improving GPU interconnect locality.
-
Replace embedded Vault in Cert Manager with native Go PKI (#59) Eliminates the embedded Vault sidecar in the cert-manager pod by implementing native Go PKI using
crypto/x509. Reduces the pod from 4 containers to 1, saving ~150MB memory and removing Vault initialization overhead while maintaining the same security model. -
Check NVLink, multi-DPU or InfiniBand capability of Machine in Instance create/update (#84) Validates that the selected machine's hardware capabilities match the requested NVLink, multi-DPU, or InfiniBand interfaces, falling back to machine-level capability checks when the Instance Type alone doesn't satisfy the requirements.
-
InfiniBand partition metadata support (#69) Adds labels/metadata support to InfiniBand Partition resources in the API, database, and workflow schema, enabling custom key-value annotation of IB partitions.
-
Allow filtering NVLink Interface API endpoint by Instance ID/Logical Partition ID (#89) Extends the NVLink Interface list endpoint with filters for Instance ID, Logical Partition ID, and Domain ID, including instance and partition summaries in responses.
-
Add RLA gRPC client in Site Agent and Site Workflow (#67) Introduces the RLA gRPC client in the Site Agent and Site Workflow services with TLS and certificate hot-reload support. RLA is disabled by default via
RLA_ENABLED=falsefor safe incremental rollout. -
Remove ElasticSearch Temporal dependency from local kind deployment (#95) Removes the ElasticSearch dependency from the local KinD Temporal deployment, reducing resource requirements for local development environments.
-
Prevent Instance update when missing on Site (#124) Blocks Instance update requests when the instance is marked as missing on Site, since the Site Controller cannot apply updates to unreachable instances. Deletion is still permitted.
-
Switch license for Rack management files from NVIDIA to Apache 2.0 (#118) Corrects the license headers on rack management files from NVIDIA proprietary to Apache 2.0, aligning with the repository's open-source license.
-
Limit name uniqueness to per Site, per Tenant when creating InfiniBand Partition (#107) Relaxes the global uniqueness constraint on InfiniBand Partition names to be scoped per Site per Tenant, allowing different tenants to use the same partition name on the same or different sites.
-
Revise DPU Extension Service deletion and version deletion logic (#110) Fixes multiple issues in DPU Extension Service lifecycle: deletion now immediately updates the cache instead of waiting for inventory sync, and deleting the latest version correctly falls back to the most recent older version.
-
Give execution permission to the setup-local.sh script (#114) Adds execute permission to the local setup script, fixing
make kind-resetfailures. -
Increase Keycloak memory limit (#112) Increases the Keycloak container memory limit from 1 GiB to 3 GiB to prevent OOM kills during startup on systems with certain memory configurations.
-
Use 127.0.0.1 in the output of make preview-openapi (#111) Replaces
localhostwith127.0.0.1in the preview URL output to avoid connection failures on systems that resolve localhost to IPv6::1. -
Fixed OpenAPI schema issues, added validation scripts (#105) Fixes numerous issues in the OpenAPI schema and adds
make lint-openapiandmake preview-openapicommands for ongoing schema validation. -
Send description along with create and update NVLink partition request (#87) Includes the description field in NVLink Logical Partition create and update requests to the Site Controller, fixing an "organization ID should not be updated" error.
-
Return Machine board serial, product name and vendor in DMI data (#97) Exposes board serial number, product name, and vendor from DMI data in Machine responses, with backward-compatible partial deserialization for older machine records.
-
Check if NVLink Logical Partition is already connected on Instance NVLink Interface update (#83) Adds validation to verify whether a provided NVLink Logical Partition ID matches an already-configured partition before attempting an update, preventing conflicting partition assignments.
- Enforce PR title format (#100) Adds a CI check that validates PR titles follow the conventional commit format with a lowercase category prefix and a capitalized description of at least 20 characters.
-
Update API name config, README for OpenAPI schema development (#117) Adds the missing API name configuration and comprehensive OpenAPI schema development instructions.
-
Update core proto generation command in Makefile (#109) Updates the Makefile command for regenerating Core protobuf definitions.
-
Update Github Go module path to bare-metal-manager-rest (#108) Migrates the Go module path to the new GitHub repository location.
-
Start renaming NICo to NVIDIA Metal Manager (#99) Begins the rebranding effort from NICo to NVIDIA Metal Manager across documentation and configuration.
-
Regenerate third party license file (#103) Regenerates the third-party license file to reflect current dependency state.
-
Update license file to Apache 2.0, remove NVIDIA license file headers (#93) Converts the project license to Apache 2.0 and removes proprietary NVIDIA license headers from source files.
-
Update Temporal and Postgres URLs to use local DNS (#98) Updates service URLs to use Kubernetes local DNS names for improved reliability in local development environments.
-
Add GitHub interaction templates and code of conduct (#96) Adds GitHub issue templates, discussion templates, and a code of conduct to standardize community interactions.
-
Upgrade otelecho to v0.65.0 to fix security vulnerability (#91) Updates the OpenTelemetry Echo middleware to address a known security vulnerability.
-
Remove Kubernetes files (#94) Removes legacy Kubernetes manifest files that have been superseded by Helm charts and Kustomize.
v1.0.2 — 2026-01-29
- Use revised matching regex for release tags (#81) Corrects the regex pattern used for matching release version tags in CI workflows, ensuring proper triggering of release pipelines.
v1.0.1 — 2026-01-28
-
Add proto and protobuf for RLA gRPC API (#66) Introduces the protobuf definitions and generated code for the RLA gRPC API, establishing a clear separation between RLA and NICo Core proto structures.
-
Support custom claims for JWT issuers (#41) Adds comprehensive custom JWT issuer support with multiple claim mapping strategies: static org with static roles, service accounts, dynamic roles from token attributes, and fully dynamic org/role extraction. Includes parallel JWKS fetching, configurable timeouts, and stricter validation rules.
-
Add schema for Instance batch API (#61) Adds OpenAPI schema documentation for the instance batch creation API endpoint.
-
Enable license scanning (#64) Adds automated license scanning to the CI pipeline for dependency compliance verification.
-
Revise NVLink Logical Partition inventory metadata update logic (#74) Fixes a nil pointer exception caused by incorrect placement of the nil evaluation in NVLink Logical Partition metadata update code.
-
Missing body close on HTTP responses (#71) Adds deferred
Body.Close()calls on HTTP responses in Slack notification code, preventing resource leaks. -
Verify provided version in DPU Extension Service active version list for deletion (#63) Validates that the specified DPU Extension Service version exists in the active version list before allowing deletion, preventing attempts to delete non-existent versions.
-
Fix docker image artifact path in upload workflow (#78) Aligns the Docker image artifact path in the upload workflow with the actual location of built images.
-
Fix docker image tag for artifact extraction/upload in workflows (#77) Corrects Docker image tags used during artifact extraction to match the updated image naming convention.
-
Add pre-commit for secrets scanning (#68) Integrates TruffleHog as a pre-commit hook for automated secrets scanning, preventing accidental credential commits.
-
Update scanner actions with improved PR comment handling (#75) Security scanners (TruffleHog, Trivy, CodeQL) now update existing PR comments instead of creating duplicates, and only post when scan status changes.
-
Refactor and optimize Instance API handlers (#60) Converts N+1 individual
GetByIDcalls to efficient batchGetAllqueries for subnets, VPC prefixes, partitions, and DPU extensions in the instance update handler. Adds bulk NVLink Interface status update support. -
Use non-deprecated UserData field for OS create/update (no PR) Migrates site-agent from the deprecated UserData field to the canonical one that applies across all OS types.
-
Revise docker image push policy (#62) Implements a structured Docker tagging policy:
VERSION-SHAfor main branch,latestfor most recent main build, version tags for releases, and branch-name tags for feature branches with opt-inpush-containerkeyword.
v1.0.0 — 2026-01-20
-
Add support for bulk creation/update of Expected Machines (#35) Introduces batch API endpoints for creating and updating Expected Machine entries, enabling efficient onboarding of large hardware inventories in a single request.
-
Add unique DB index for Expected Machine BMC MAC address per Site (#36) Adds a unique database index on BMC MAC addresses scoped per Site, preventing duplicate machine registration and ensuring data integrity during bulk imports.
-
Embed Vault in cert-manager pod (#50) Moves Vault from a standalone deployment to a sidecar container in the cert-manager pod, and deploys cert-manager.io with a Vault ClusterIssuer. Site-manager TLS certificates are now issued through cert-manager.io instead of init containers.
-
Retry local site creation during setup (#45) Adds retry logic for local site creation during
make kind-reset, handling the race condition where site-manager may not be fully ready when the setup script runs. -
Add name as orderBy field in DpuExtensionService (#53) Adds
nameas a valid ordering field for DPU Extension Service list queries. -
Update OpenAPI spec with NVLink schema and DPU Extension Service modifications (#52) Updates the OpenAPI schema to include NVLink-related schemas and corrects DPU Extension Service endpoint definitions.
-
Make sure NVLink interface is marked as Pending when sending update request (#40) Ensures newly added NVLink interfaces are set to Pending status when included in an update request, correctly reflecting their provisioning state.
-
Fix inconsistent test data for MachineCapability GetAll tests (#34) Corrects test fixtures that were producing inconsistent results in MachineCapability GetAll test cases.
-
Wait for mock NICo server to be up before running site-agent tests (#37) Fixes flaky site-agent tests by ensuring the mock NICo gRPC server is fully started before test execution begins, resolving ~50% test failure rates on some machines.
-
Only run promotion workflow on main branch (#42) Restricts the release promotion job to the main branch, preventing unmerged PR branches from showing a perpetually pending "Promote to Release Candidate" check.
-
Updated Go version to 1.25.4 and fixed unit tests (#31) Upgrades the Go toolchain to 1.25.4 across all modules and fixes unit tests affected by the version change.
-
Add golangci-lint and revive config files (#30) Adds the previously missing golangci-lint and revive configuration files required for consistent linting across the project.
-
Revise Slack notification payload for new PR (#57) Fixes the Slack notification message format for new pull request events.
-
Update API version in schema, add local preview script (#44) Updates the API version in the OpenAPI schema and adds a local Redoc preview script for browsing the rendered API documentation.
-
Enable binary/docker build for all branches, add security scans (#54) Extends CI to build binaries and Docker images on all branches (push-only for main/release/tags) and adds TruffleHog, Trivy, and CodeQL security scanning.
-
Do not accept all whitespace strings in Expected Machine attributes and labels (#49) Rejects all-whitespace strings for Expected Machine attributes and label keys, preventing accidental entry of blank values.
-
Improve SiteID validation for APIExpectedMachineUpdateRequest (#46) Strengthens SiteID validation in Expected Machine update requests and updates the OpenAPI schema with batch operation documentation.
-
Set up binaries build action and build condition (#43) Adds a dedicated GitHub Actions workflow for cross-platform Go binary builds (linux/amd64, linux/arm64, darwin/arm64) with selective triggering on main, release branches, and tags.
v0.1.0 — 2026-01-09
- Initial NICo REST release for GitHub (no PR) The foundational release of NCX Infra Controller REST, establishing the multi-tenant REST API for bare-metal lifecycle management. Includes the core API server, Temporal workflow engine integration, site management, certificate management, authentication via Keycloak/JWT, database layer with PostgreSQL, and the initial OpenAPI specification.