Refactor old special steps by maxdml · Pull Request #245 · dbos-inc/dbos-transact-golang

maxdml · 2026-01-31T00:42:57Z

Move setEvent and send to be run throughout runAsTxn.

Ideally we'd run getEvent, recv, and sleep using runAsTxn, but this is currently very challenging because:

getEvent/recv update the stepID on their own, to generate a step ID for sleep. This complicates the whole logic of checkpointing
sleep has its own logic to check whether it executed or not, to become durable

Also:

Allow setEvent to be sent within a step
Allow retrying transactions during runAsTxn for CRDB
Fix > 1 msg consumption in Recv
Allow setting isolation level for runAsTxn (default read committed)

This reverts commit 3891a07.

maxdml · 2026-02-02T18:13:26Z

dbos/system_database.go

-	if wfState.isWithinStep {
-		return 0, newStepExecutionError(wfState.workflowID, functionName, fmt.Errorf("cannot call Sleep within a step"))
-	}
-


Lifted to pre-sysdb invocation

maxdml · 2026-02-02T18:13:55Z

dbos/system_database.go

-	if wfState.isWithinStep {
-		return nil, newStepExecutionError(wfState.workflowID, functionName, fmt.Errorf("cannot call Recv within a step"))
-	}


lifted to pre-sysdb invocation

maxdml · 2026-02-02T18:14:49Z

dbos/workflow.go

 		if err != nil {
 			c.logger.Error("failed to insert workflow status", "error", err, "workflow_id", workflowID)
-			return err
+			return newWorkflowExecutionError(workflowID, fmt.Errorf("failed to insert workflow status: %w", err))


This actually is a small breaking change -- wrap the sysdb error into a yet-more-explanatory workflow error. (Hence the changes to use error.Is to look for a wrapped error in the entire error tree.)

maxdml · 2026-02-02T18:15:00Z

dbos/workflow.go

-			err = retry(uncancellableCtx, func() error {
-				return c.systemDB.recordChildWorkflow(uncancellableCtx, childInput)
-			}, withRetrierLogger(c.logger))
+			err = c.systemDB.recordChildWorkflow(uncancellableCtx, childInput)


Undeeded retry within a larger retry + was buggy because retry cancel was uncancellable.

maxdml · 2026-02-02T18:16:27Z

dbos/system_database.go

-		if wfState.isWithinStep {
-			return newStepExecutionError(wfState.workflowID, functionName, fmt.Errorf("cannot call Send within a step"))
-		}


Lifted to pre-sysdb invocation

dbos/debouncer.go

maxdml · 2026-02-04T00:34:18Z

dbos/system_database.go

-        WITH oldest_entry AS (
-            SELECT destination_uuid, topic, message, created_at_epoch_ms
-            FROM %s.notifications
-            WHERE destination_uuid = $1 AND topic = $2
-            ORDER BY created_at_epoch_ms ASC
-            LIMIT 1
-        )
-        DELETE FROM %s.notifications
-        WHERE destination_uuid = (SELECT destination_uuid FROM oldest_entry)
-          AND topic = (SELECT topic FROM oldest_entry)
-          AND created_at_epoch_ms = (SELECT created_at_epoch_ms FROM oldest_entry)
-        RETURNING message`, pgx.Identifier{s.schema}.Sanitize(), pgx.Identifier{s.schema}.Sanitize())


This query can delete more than 1 message if created_at_epoch_ms is within the same millisecond. This has been surfaced by 1) not running send() inside a transaction outside of a workflow and 2) a recent change to the Golang migration where we now have:

created_at_epoch_ms BIGINT NOT NULL DEFAULT (EXTRACT(epoch FROM now())::numeric * 1000)::bigint,

instead of

created_at_epoch_ms BIGINT NOT NULL DEFAULT (EXTRACT(epoch FROM now()) * 1000.0)::bigint,

The first line converts the return value of now() (double precision, float8) to a numeric, which result in the *1000 multiplication being exact numeric and quite stable.

The second line (what we have in Python) does the multiplication on double precision, then converts to bigint, which can do a truncation. Because multiplication on floating points can (often) have errors, this meant more volatility in the truncation, which, I think could have contributed to obfuscate this bug.

maxdml added 20 commits January 30, 2026 16:41

refuse sleep/recv execution within a step before calling sysdb

f40c0db

missing retries

a71f0bd

return proper error

7723cbf

remove superfluous retry

a435f58

nit

ea6a2ec

propagate stepID down a workflow context tree

23d93d8

run setEvent from within runAsTxn

ef4ce90

run send with runAsTxn if within a workflow

1f7bf00

revert

f48cccb

adjust error parsing

4239a22

update test

4901aaa

handle 40001 when sending during debounce

726015c

debug

cc0f286

revert

8ba8b88

tests nits

a350c45

retry transaction management in runAsTxn

3af6da0

debug

27fdb77

should not timeout so use small timeout

c90b34a

try always using a txn for send

3891a07

more lenient timeout

5699b9f

maxdml force-pushed the refactor-old-special-steps branch from f1bde3e to 5699b9f Compare February 3, 2026 03:41

maxdml added 9 commits February 2, 2026 19:43

Revert "try always using a txn for send"

12bab3a

This reverts commit 3891a07.

remove nested retries

0102e5c

cleanup

9543d91

fix

1844aee

nit

dadd479

debug

c847897

walk the full error tree

7707b1c

remove nested retry

0b7f444

wrap runAsTxn in retries + add missing retries

2d22b04

maxdml added 3 commits February 3, 2026 13:38

fix

adb9cc0

fix

3329d59

cleanup

595b1de

maxdml changed the base branch from main to fix-txn-retries February 3, 2026 21:54

maxdml added 16 commits February 3, 2026 13:54

Merge branch 'fix-txn-retries' into refactor-old-special-steps

9d6c307

fix

7268e9f

fix post-merge

15698bb

reinstate post-merge loss

4763419

simply retry

054a0d4

debug

313f34e

deug

a85d175

debug

b3f300c

debug

24e3d80

try always running ina txn

c9045cc

debug

61670d7

no txn

2190665

debug

6356b12

less debug

9f3876f

prevent multiple deletion colision due to timestamp

2177d6b

cleanup

8451566

maxdml marked this pull request as ready for review February 4, 2026 00:35

maxdml commented Feb 4, 2026

View reviewed changes

Base automatically changed from fix-txn-retries to main February 4, 2026 01:30

maxdml added 4 commits February 3, 2026 17:30

Merge branch 'main' into refactor-old-special-steps

aed8664

allow chain of retry conditions + add one for 40001

a28ab94

only use repeatable read for resumeWorkflow

46be9b5

set isolation level when beginning tx

45d19a2

kraftp approved these changes Feb 4, 2026

View reviewed changes

maxdml merged commit 61adb38 into main Feb 4, 2026
4 of 5 checks passed

maxdml deleted the refactor-old-special-steps branch February 4, 2026 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor old special steps#245

Refactor old special steps#245
maxdml merged 67 commits intomainfrom
refactor-old-special-steps

maxdml commented Jan 31, 2026 •

edited

Loading

Uh oh!

maxdml Feb 2, 2026

Uh oh!

maxdml Feb 2, 2026

Uh oh!

maxdml Feb 2, 2026

Uh oh!

maxdml Feb 2, 2026

Uh oh!

maxdml Feb 2, 2026

Uh oh!

Uh oh!

maxdml Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxdml commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxdml Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

maxdml Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

maxdml Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

maxdml Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

maxdml Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxdml Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxdml commented Jan 31, 2026 •

edited

Loading