Skip to content

Refactor old special steps#245

Merged
maxdml merged 67 commits intomainfrom
refactor-old-special-steps
Feb 4, 2026
Merged

Refactor old special steps#245
maxdml merged 67 commits intomainfrom
refactor-old-special-steps

Conversation

@maxdml
Copy link
Collaborator

@maxdml maxdml commented Jan 31, 2026

Move setEvent and send to be run throughout runAsTxn.

Ideally we'd run getEvent, recv, and sleep using runAsTxn, but this is currently very challenging because:

  • getEvent/recv update the stepID on their own, to generate a step ID for sleep. This complicates the whole logic of checkpointing
  • sleep has its own logic to check whether it executed or not, to become durable

Also:

  • Allow setEvent to be sent within a step
  • Allow retrying transactions during runAsTxn for CRDB
  • Fix > 1 msg consumption in Recv
  • Allow setting isolation level for runAsTxn (default read committed)

@maxdml maxdml force-pushed the refactor-old-special-steps branch from f1bde3e to 5699b9f Compare February 3, 2026 03:41
@maxdml maxdml changed the base branch from main to fix-txn-retries February 3, 2026 21:54
@maxdml maxdml marked this pull request as ready for review February 4, 2026 00:35
Comment on lines -1683 to -1686
if wfState.isWithinStep {
return 0, newStepExecutionError(wfState.workflowID, functionName, fmt.Errorf("cannot call Sleep within a step"))
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifted to pre-sysdb invocation

Comment on lines -2150 to -2152
if wfState.isWithinStep {
return nil, newStepExecutionError(wfState.workflowID, functionName, fmt.Errorf("cannot call Recv within a step"))
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lifted to pre-sysdb invocation

if err != nil {
c.logger.Error("failed to insert workflow status", "error", err, "workflow_id", workflowID)
return err
return newWorkflowExecutionError(workflowID, fmt.Errorf("failed to insert workflow status: %w", err))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually is a small breaking change -- wrap the sysdb error into a yet-more-explanatory workflow error. (Hence the changes to use error.Is to look for a wrapped error in the entire error tree.)

dbos/workflow.go Outdated
err = retry(uncancellableCtx, func() error {
return c.systemDB.recordChildWorkflow(uncancellableCtx, childInput)
}, withRetrierLogger(c.logger))
err = c.systemDB.recordChildWorkflow(uncancellableCtx, childInput)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undeeded retry within a larger retry + was buggy because retry cancel was uncancellable.

Comment on lines -2060 to -2062
if wfState.isWithinStep {
return newStepExecutionError(wfState.workflowID, functionName, fmt.Errorf("cannot call Send within a step"))
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifted to pre-sysdb invocation

Comment on lines -2265 to -2276
WITH oldest_entry AS (
SELECT destination_uuid, topic, message, created_at_epoch_ms
FROM %s.notifications
WHERE destination_uuid = $1 AND topic = $2
ORDER BY created_at_epoch_ms ASC
LIMIT 1
)
DELETE FROM %s.notifications
WHERE destination_uuid = (SELECT destination_uuid FROM oldest_entry)
AND topic = (SELECT topic FROM oldest_entry)
AND created_at_epoch_ms = (SELECT created_at_epoch_ms FROM oldest_entry)
RETURNING message`, pgx.Identifier{s.schema}.Sanitize(), pgx.Identifier{s.schema}.Sanitize())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query can delete more than 1 message if created_at_epoch_ms is within the same millisecond. This has been surfaced by 1) not running send() inside a transaction outside of a workflow and 2) a recent change to the Golang migration where we now have:

created_at_epoch_ms BIGINT NOT NULL DEFAULT (EXTRACT(epoch FROM now())::numeric * 1000)::bigint,

instead of

created_at_epoch_ms BIGINT NOT NULL DEFAULT (EXTRACT(epoch FROM now()) * 1000.0)::bigint,

The first line converts the return value of now() (double precision, float8) to a numeric, which result in the *1000 multiplication being exact numeric and quite stable.

The second line (what we have in Python) does the multiplication on double precision, then converts to bigint, which can do a truncation. Because multiplication on floating points can (often) have errors, this meant more volatility in the truncation, which, I think could have contributed to obfuscate this bug.

Base automatically changed from fix-txn-retries to main February 4, 2026 01:30
@maxdml maxdml merged commit 61adb38 into main Feb 4, 2026
4 of 5 checks passed
@maxdml maxdml deleted the refactor-old-special-steps branch February 4, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants