Skip to content

Avoid deadlocks when dispatching event callbacks#973

Open
zhufangda-vita wants to merge 6 commits into
ros2:humblefrom
zhufangda-vita:fix/humble-event-callback-deadlock
Open

Avoid deadlocks when dispatching event callbacks#973
zhufangda-vita wants to merge 6 commits into
ros2:humblefrom
zhufangda-vita:fix/humble-event-callback-deadlock

Conversation

@zhufangda-vita

Copy link
Copy Markdown

Description

Avoid deadlocks caused by invoking event callbacks while holding internal mutexes.

This change moves callback execution out of event_mutex_, events_mutex_, and
graph_mutex_ critical sections. Graph event callbacks are collected while the
graph is updated, enqueued in graph mutation order, and dispatched after internal
locks are released.

Key changes:

  • Copy event callbacks and user data under event_mutex_, then invoke them after
    releasing the lock.
  • Collect graph event callbacks under the existing lock order, enqueue them in a
    FIFO dispatch queue while graph_mutex_ is still held, and drain the queue
    after releasing internal locks.
  • Prevent recursive dispatch with a dispatching_ flag.
  • Catch and log exceptions from queued graph callbacks so one failing callback
    does not leave dispatch stuck or block later events.
  • Use weak subscription data captures for graph event callbacks.

This follows the same general direction as #937 and #955: avoid holding internal
locks while invoking callbacks that may re-enter rmw_zenoh.

Fixes # (issue)

Is this user-facing behavior change?

This should not change normal user-facing behavior. It is intended to prevent
deadlocks/hangs when graph or event callbacks are triggered concurrently.

Did you use Generative AI?

Yes. ChatGPT/Codex/Claude/Copilot was used to help analyze the lock-order issue and draft parts
of the implementation.

Additional Information

Validation performed:

  • git diff --check
  • ament_uncrustify rmw_zenoh_cpp/src/detail/event.cpp rmw_zenoh_cpp/src/detail/graph_cache.cpp rmw_zenoh_cpp/src/detail/graph_cache.hpp rmw_zenoh_cpp/src/rmw_event.cpp

A full colcon build could not be completed locally because the environment is
missing cargo; the build stops in zenoh_cpp_vendor before compiling
rmw_zenoh_cpp.

@zhufangda-vita zhufangda-vita force-pushed the fix/humble-event-callback-deadlock branch from 52e61d2 to 01aed2a Compare May 12, 2026 12:07
Queue graph event callbacks while holding graph_mutex_, then dispatch them
after releasing internal locks. This preserves graph event ordering across
concurrent graph updates while avoiding callback execution under graph_mutex_
or events_mutex_.

Also move event callbacks out of event_mutex_ critical sections and use weak
subscription data captures for graph event callbacks. Catch and log exceptions
from queued graph callbacks so one failing callback does not leave dispatching
stuck or block later events.

Signed-off-by: zhufangda fangda.zhu@vbot.cn
@zhufangda-vita zhufangda-vita force-pushed the fix/humble-event-callback-deadlock branch from 01aed2a to b97ad4a Compare May 13, 2026 02:32
Avoid holding entity mutexes while acquiring the wait set condition mutex in
guard condition, client, service, and subscription notification paths. Snapshot
wait_set_data_ under the internal lock, then trigger callbacks and notify the
wait set after releasing it to prevent ABBA deadlocks with rmw_wait.

Also defer subscription message-lost event updates until after releasing the
subscription mutex.
@JEnoch

JEnoch commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Your PR is based on forked and modified versions of zenoh-c and zenoh-cpp:

If you think changes are required in Zenoh, please also submit a PR there and link it to this PR.
We'll need to review and merge the changes in Zenoh at first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants