refactor: Migrate Machine API handlers to WithTx helper#475
refactor: Migrate Machine API handlers to WithTx helper#475chet wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughRefactors machine update and delete handlers to run DB mutations inside ChangesTransaction Refactor for Machine Handlers
Sequence Diagram(s)sequenceDiagram
participant Client
participant API_Handler
participant DB as Database
participant Temporal
Client->>API_Handler: Update/Delete Machine request
API_Handler->>DB: cdb.WithTx { GetByID(forUpdate), advisory locks, validations }
DB-->>API_Handler: locked machine + relations
API_Handler->>DB: apply DB mutations (assoc add/remove, status, labels, cleanup)
API_Handler->>Temporal: run synchronous site workflow
Temporal-->>API_Handler: workflow success / timeout/failure
alt workflow success
API_Handler->>DB: commit transaction (WithTx returns)
DB-->>API_Handler: commit
API_Handler->>Client: success response
else workflow timeout/failure
API_Handler->>Temporal: attempt TerminateWorkflowOnTimeOut (deferred post-rollback)
API_Handler->>DB: WithTx returns error -> rollback
API_Handler->>Client: error response
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-05-04 04:28:07 UTC | Commit: cda7168 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
api/pkg/api/handler/machine.go (1)
1243-1261: ⚡ Quick winUnify timeout termination handling with the shared helper to reduce drift.
Both maintenance-mode and labels branches duplicate timeout/terminate logic. Consider reusing the same deferred
common.TerminateWorkflowOnTimeOutpattern already used in the InstanceType branch for consistency and easier maintenance.♻️ Suggested consolidation pattern
+var timeoutResp func() error err = cdb.WithTx(ctx, umh.dbSession, func(mnTx *cdb.Tx) error { ... wferr = we.Get(wfCtx, nil) if wferr != nil { var timeoutErr *tp.TimeoutError if errors.As(wferr, &timeoutErr) || wferr == context.DeadlineExceeded || wfCtx.Err() != nil { - logger.Error().Err(wferr).Msg("failed to set/remove Machine maintenance mode, timeout occurred executing workflow on Site.") - newctx, newcancel := context.WithTimeout(context.Background(), cutil.WorkflowContextNewAfterTimeout) - defer newcancel() - serr := stc.TerminateWorkflow(newctx, wid, "", "timeout occurred executing set/remove maintenance mode Machine workflow") - if serr != nil { - return cutil.NewAPIError(http.StatusInternalServerError, fmt.Sprintf("Failed to terminate synchronous set/remove Machine maintenance mode workflow after timeout, Cloud and Site data may be de-synced: %s", serr), nil) - } - return cutil.NewAPIError(http.StatusInternalServerError, fmt.Sprintf("Failed to set/remove Machine maintenance mode, timeout occurred executing workflow on Site: %s", wferr), nil) + capturedErr := wferr + timeoutResp = func() error { + return common.TerminateWorkflowOnTimeOut(c, logger, stc, wid, capturedErr, "MachineMaintenance", "SetMachineMaintenance") + } + return cutil.NewAPIError(http.StatusInternalServerError, "workflow timeout", nil) } ... } return nil }) +if timeoutResp != nil { + return timeoutResp() +} if err != nil { return common.HandleTxError(c, logger, err, "Failed to update Machine, DB transaction error") }Also applies to: 1340-1358
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/machine.go` around lines 1243 - 1261, The timeout/terminate logic in machine.go duplicates the InstanceType branch; replace the inline timeout handling (the errors.As check on wferr, context.DeadlineExceeded/wfCtx.Err() and the context.WithTimeout + stc.TerminateWorkflow call that returns API errors) with the shared helper common.TerminateWorkflowOnTimeOut so both maintenance-mode and labels branches reuse the same deferred termination pattern; specifically, remove the inline block that references wferr, wfCtx, wid and stc.TerminateWorkflow and invoke/defer common.TerminateWorkflowOnTimeOut with the same parameters/context as used by the InstanceType branch, and make the same change at the other duplicate site (the similar block around the 1340–1358 area) to ensure consistent behavior and logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 1635-1638: The handler currently logs the DB error but passes the
raw error (derr) into cutil.NewAPIError which can leak internal DB details to
clients via HandleTxError; change the NewAPIError call in the machine deletion
flow (the block that logs "error while retrieving StatusDetail for Machine" and
calls cutil.NewAPIError) to pass nil (or a sanitized message) as the error
payload while keeping the detailed derr only in
logger.Error().Err(derr).Msg(...), so clients receive a generic "Failed to
delete Machine" API error without internal DB details.
---
Nitpick comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 1243-1261: The timeout/terminate logic in machine.go duplicates
the InstanceType branch; replace the inline timeout handling (the errors.As
check on wferr, context.DeadlineExceeded/wfCtx.Err() and the context.WithTimeout
+ stc.TerminateWorkflow call that returns API errors) with the shared helper
common.TerminateWorkflowOnTimeOut so both maintenance-mode and labels branches
reuse the same deferred termination pattern; specifically, remove the inline
block that references wferr, wfCtx, wid and stc.TerminateWorkflow and
invoke/defer common.TerminateWorkflowOnTimeOut with the same parameters/context
as used by the InstanceType branch, and make the same change at the other
duplicate site (the similar block around the 1340–1358 area) to ensure
consistent behavior and logging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5d079968-c2de-4e56-8d2a-7299e158c92c
📒 Files selected for processing (1)
api/pkg/api/handler/machine.go
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
api/pkg/api/handler/machine.go (1)
1689-1692:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTypographical error in API error message.
The word "assocations" should be "associations" in the error message returned to clients.
📝 Suggested fix
if derr != nil { logger.Error().Err(derr).Msg("error pulling instance details for Machine in DB") - return cutil.NewAPIError(http.StatusInternalServerError, "Failed to query Instance assocations for Machine", nil) + return cutil.NewAPIError(http.StatusInternalServerError, "Failed to query Instance associations for Machine", nil) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/pkg/api/handler/machine.go` around lines 1689 - 1692, Fix the typographical error in the API error message returned by the cutil.NewAPIError call: change the string "Failed to query Instance assocations for Machine" to "Failed to query Instance associations for Machine" where the error handling uses derr, logger.Error().Err(derr).Msg(...) and returns cutil.NewAPIError(...) so clients receive the corrected spelling.
🧹 Nitpick comments (1)
api/pkg/api/handler/machine.go (1)
1034-1039: 💤 Low valueWorkflow ID format inconsistency.
The workflow ID at line 1035 lacks the dash separator before the UUID suffix. Other workflow IDs in this file consistently use a trailing dash (e.g.,
"associate-machines-with-instance-type-","site-set-maintenance-","site-update-machine-metadata-").🔧 Suggested fix
workflowOptions := temporalClient.StartWorkflowOptions{ - ID: "remove-machine-instance-type-association" + machine.InstanceTypeID.String(), + ID: "remove-machine-instance-type-association-" + machine.InstanceTypeID.String(), TaskQueue: queue.SiteTaskQueue, WorkflowExecutionTimeout: cutil.WorkflowExecutionTimeout, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/pkg/api/handler/machine.go` around lines 1034 - 1039, The workflow ID string in the temporalClient.StartWorkflowOptions (variable workflowOptions) is missing the trailing dash before the machine.InstanceTypeID UUID; update the ID construction for "remove-machine-instance-type-association" to include a "-" separator before machine.InstanceTypeID.String() so it matches the other IDs (e.g., "associate-machines-with-instance-type-"), ensuring the ID becomes "remove-machine-instance-type-association-" + machine.InstanceTypeID.String().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 1689-1692: Fix the typographical error in the API error message
returned by the cutil.NewAPIError call: change the string "Failed to query
Instance assocations for Machine" to "Failed to query Instance associations for
Machine" where the error handling uses derr, logger.Error().Err(derr).Msg(...)
and returns cutil.NewAPIError(...) so clients receive the corrected spelling.
---
Nitpick comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 1034-1039: The workflow ID string in the
temporalClient.StartWorkflowOptions (variable workflowOptions) is missing the
trailing dash before the machine.InstanceTypeID UUID; update the ID construction
for "remove-machine-instance-type-association" to include a "-" separator before
machine.InstanceTypeID.String() so it matches the other IDs (e.g.,
"associate-machines-with-instance-type-"), ensuring the ID becomes
"remove-machine-instance-type-association-" + machine.InstanceTypeID.String().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f7bff02e-ad2e-4f5b-a43b-3d8b488be206
📒 Files selected for processing (1)
api/pkg/api/handler/machine.go
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 1253-1255: The timeout helper is being called with the wrong
workflow name; update the calls to common.TerminateWorkflowOnTimeOut so the
passed workflowName matches the actual starter functions (use
"SetMachineMaintenance" where you currently pass "SetMaintenanceMode", and
"UpdateMachineMetadata" where you currently pass "UpdateMetadata"). Locate the
timeoutResp closures that call common.TerminateWorkflowOnTimeOut (variables:
timeoutResp, c, logger, stc, wid, wferr) and replace the incorrect literal
workflow name argument with the correct starter name string so logs, termination
reasons, and responses reference the actual workflow started.
- Around line 1642-1649: The code currently blocks deletion by comparing
lastStatus.Message to the literal "Machine is missing on Site"; instead, change
the check to use a structured status field (e.g., a status code/enum or boolean)
rather than free-form text: locate the branch that checks IsMissingOnSite and
replace the string comparison of lastStatus.Message with a check against a
structured field on the status (for example lastStatus.Code == "MISSING_ON_SITE"
or lastStatus.IsMissingOnSite == true), update any status creation logic to set
that structured field if it doesn't exist, and log the structured field (not the
free text) when rejecting the delete; keep references to lastStatus.Message,
IsMissingOnSite and StatusDetail.Message when updating conditions so other
callers remain consistent.
- Around line 941-1016: The transaction currently only acquires an advisory lock
on the old InstanceType (see TryAcquireAdvisoryLock using emit.InstanceTypeID)
but not on the Machine, allowing concurrent PATCHes to race; fix by acquiring an
advisory lock for the Machine at the start of the cdb.WithTx callback (use
machine.ID as lock key via cdb.GetAdvisoryLockIDFromString(machine.ID.String())
or equivalent) before calling mitDAO.GetAll/mitDAO.DeleteByID/CreateFromParams
and mDAO.Update/Clear so the read-delete-create/update sequence on machine.ID is
serialized per machine; ensure the lock acquisition error is handled similarly
to the existing lock error handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a071236b-c0dd-4b61-b9ac-5027e1014247
📒 Files selected for processing (1)
api/pkg/api/handler/machine.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
api/pkg/api/handler/machine.go (1)
1582-1582:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPreserve contextual logger from
SetupHandler.The reassignment at line 1582 uses the global
log.With()instead of extending the contextualloggerreturned bycommon.SetupHandler. This discards request-scoped context (trace IDs, org, user, etc.) that is valuable for correlating logs during incident investigation.Suggested fix
- logger = log.With().Str("Machine", mID).Logger() + logger = logger.With().Str("Machine", mID).Logger()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/pkg/api/handler/machine.go` at line 1582, The current reassignment uses the global log.With() and overwrites the request-scoped logger from common.SetupHandler, dropping trace/org/user context; change the code that sets logger (currently "logger = log.With().Str(\"Machine\", mID).Logger()") to extend the existing request-scoped logger returned by SetupHandler (e.g., use logger = logger.With().Str("Machine", mID).Logger()) so the new logger preserves the original context like trace IDs, org, and user.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@api/pkg/api/handler/machine.go`:
- Line 1582: The current reassignment uses the global log.With() and overwrites
the request-scoped logger from common.SetupHandler, dropping trace/org/user
context; change the code that sets logger (currently "logger =
log.With().Str(\"Machine\", mID).Logger()") to extend the existing
request-scoped logger returned by SetupHandler (e.g., use logger =
logger.With().Str("Machine", mID).Logger()) so the new logger preserves the
original context like trace IDs, org, and user.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a94a4ce4-7fc9-43a1-b8ca-62b665564cad
📒 Files selected for processing (1)
api/pkg/api/handler/machine.go
d15f8e4 to
375ecda
Compare
Apply `WithTx` (from NVIDIA#462) to all four API handlers in `machine`, including using the `TerminateWorkflowOnTimeOut` helper. This also introduces a `MachineStatusMissing` enum to improve how we handle errors for missing machines (which came as a recommendation from @coderabbitai). @coderabbitai also gave some feedback, which has also been addressed, where this: - Makes sure to use the contextual `logger` instead of the global `log` in the `maintenance-mode` workflow error path. - Passes `tx` to `common.GetInfrastructureProviderForOrg` so the read participates in the same transaction as the surrounding writes. Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Description
This applies the new
WithTxpattern from #462 to themachineshandlers, including using theTerminateWorkflowOnTimeOuthelper.Keeping these PRs smaller and tightly scoped so they're:
@coderabbitai also gave some feedback, which has also been addressed, where this:
loggerinstead of the globallogin themaintenance-modeworkflow error path.txtocommon.GetInfrastructureProviderForOrgso the read participates in the same transaction as the surrounding writes.Signed-off-by: Chet Nichols III chetn@nvidia.com
Type of Change
Services Affected
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes