Avoid deadlocks when dispatching event callbacks#973
Open
zhufangda-vita wants to merge 6 commits into
Open
Conversation
52e61d2 to
01aed2a
Compare
Queue graph event callbacks while holding graph_mutex_, then dispatch them after releasing internal locks. This preserves graph event ordering across concurrent graph updates while avoiding callback execution under graph_mutex_ or events_mutex_. Also move event callbacks out of event_mutex_ critical sections and use weak subscription data captures for graph event callbacks. Catch and log exceptions from queued graph callbacks so one failing callback does not leave dispatching stuck or block later events. Signed-off-by: zhufangda fangda.zhu@vbot.cn
01aed2a to
b97ad4a
Compare
Avoid holding entity mutexes while acquiring the wait set condition mutex in guard condition, client, service, and subscription notification paths. Snapshot wait_set_data_ under the internal lock, then trigger callbacks and notify the wait set after releasing it to prevent ABBA deadlocks with rmw_wait. Also defer subscription message-lost event updates until after releasing the subscription mutex.
Contributor
|
Your PR is based on forked and modified versions of zenoh-c and zenoh-cpp: If you think changes are required in Zenoh, please also submit a PR there and link it to this PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Avoid deadlocks caused by invoking event callbacks while holding internal mutexes.
This change moves callback execution out of
event_mutex_,events_mutex_, andgraph_mutex_critical sections. Graph event callbacks are collected while thegraph is updated, enqueued in graph mutation order, and dispatched after internal
locks are released.
Key changes:
event_mutex_, then invoke them afterreleasing the lock.
FIFO dispatch queue while
graph_mutex_is still held, and drain the queueafter releasing internal locks.
dispatching_flag.does not leave dispatch stuck or block later events.
This follows the same general direction as #937 and #955: avoid holding internal
locks while invoking callbacks that may re-enter
rmw_zenoh.Fixes # (issue)
Is this user-facing behavior change?
This should not change normal user-facing behavior. It is intended to prevent
deadlocks/hangs when graph or event callbacks are triggered concurrently.
Did you use Generative AI?
Yes. ChatGPT/Codex/Claude/Copilot was used to help analyze the lock-order issue and draft parts
of the implementation.
Additional Information
Validation performed:
git diff --checkament_uncrustify rmw_zenoh_cpp/src/detail/event.cpp rmw_zenoh_cpp/src/detail/graph_cache.cpp rmw_zenoh_cpp/src/detail/graph_cache.hpp rmw_zenoh_cpp/src/rmw_event.cppA full
colcon buildcould not be completed locally because the environment ismissing
cargo; the build stops inzenoh_cpp_vendorbefore compilingrmw_zenoh_cpp.