Skip to content

Investigate: some transactions complete without emitting TransactionCompleted/TransactionTimedOut events #3099

Description

@iduartgomez

Context

Discovered during investigation of #2981 (pending_op_results memory leak). The fix adds cleanup on TransactionCompleted/TransactionTimedOut events plus a periodic safety-net sweep for entries where the sender is closed.

The reviewer asked: why do some transactions complete without emitting these events?

Root Cause Analysis

Several code paths in handle_op_result() (crates/core/src/operations/mod.rs) reach terminal states without emitting TransactionCompleted:

1. Local state queueing (UpdateBroadcasting)

When OperationResult has return_msg: None and state: Some(updated_state), the state is pushed to OpManager but no completion event is emitted (line ~258-266).

2. StatePushed error path

Err(OpError::StatePushed) returns Ok(None) — the operation continues locally without any event (line ~113-117).

3. Duplicate message for completed op

Err(OpError::OpNotPresent(tx)) returns Ok(None) — the original transaction's callback was already inserted but if the original completion path didn't emit an event, the entry leaks (line ~118-127).

4. Root operation waiting for sub-operations

When a finalized operation has sub-operations that haven't completed, it's stored in root_ops_awaiting_sub_ops and returns Ok(None) — no event emitted until all children complete (line ~180-213).

5. Callback invocation guard

In process_message_decoupled() (crates/core/src/node/mod.rs), the callback sender is only invoked when is_operation_completed(&op_result) returns true. The pending_op_results entry is inserted regardless, but the callback may never fire for the paths above.

Impact

The periodic safety-net cleanup in #2981 handles all these cases by checking sender.is_closed(). The event-driven cleanup handles the common case. Together they prevent the leak.

Recommendation

Consider auditing whether TransactionCompleted should be emitted in more of these terminal paths, particularly:

  • When StatePushed operations actually finish their local processing
  • When root operations' sub-operations all complete (verify the parent event fires)

This would make the system more consistent, though the safety-net cleanup makes it non-critical.

Related

Metadata

Metadata

Assignees

Labels

E-mediumExperience needed to fix/implement: Medium / intermediateP-mediumMedium priorityT-bugType: Something is broken

Type

No type
No fields configured for issues without a type.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions