Skip to content

fix: keep pub_ valid for simple publishers (debug nightly crash)#990

Open
YuanYuYuan wants to merge 1 commit into
ros2:rollingfrom
ZettaScaleLabs:fix/advanced-pub-debug-crash
Open

fix: keep pub_ valid for simple publishers (debug nightly crash)#990
YuanYuYuan wants to merge 1 commit into
ros2:rollingfrom
ZettaScaleLabs:fix/advanced-pub-debug-crash

Conversation

@YuanYuYuan

@YuanYuYuan YuanYuYuan commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Closes #989.

Summary

Fix a null advanced-publisher dereference that makes every rmw_zenoh_cpp pub/sub test fail in the nightly_linux_debug job (e.g. nightly_linux_debug #3865 — 265 failures, all the same panic).

In debug builds every publish aborts with:

thread '<unnamed>' panicked at zenoh_c_vendor/src/advanced_publisher.rs:297:10:
unsafe precondition(s) violated: hint::unreachable_unchecked must never be reached
thread caused non-unwinding panic. aborting.

In release builds the same path is undefined behaviour (the check is compiled out), which is why only the debug nightly surfaces it.

Root cause

The per-endpoint refactor (#930) added a base endpoint for simple (non-buffer-aware) publishers. The PublisherData constructor initialises pub_ from its argument and then, for simple publishers, moves pub_ into base_endpoint->pub:

pub_(std::move(pub)),
...
base_endpoint->pub = std::optional<zenoh::ext::AdvancedPublisher>(std::move(pub_));

After the move, pub_ is in a moved-from (null) state. But publish(), publish_serialized_message() and shutdown() all still operate on pub_, so the first pub_.put() calls ze_advanced_publisher_loan_mut() on a null owned publisher, hitting unwrap_unchecked() on None.

Approach

Rather than just remove the stray move, this unifies how a publisher is represented. The deeper problem is that ownership lived in two divergent places — PublisherData::pub_ and endpoints_->pub — with parallel creation and publish/shutdown paths that can drift out of sync (which is exactly how the bug arose). Patching the move fixes this instance but leaves the same trap for the next change. Collapsing to a single representation makes the inconsistent state unrepresentable, which is the better long-term-maintenance outcome (per review discussion).

Key Changes

  • Remove the PublisherData::pub_ member. The standard/base publisher now lives in base_endpoint_ (a PublisherEndpoint), like every other endpoint.
  • create_publisher_endpoint(full_key, bool buffer_aware) is now the single place that declares an AdvancedPublisher: buffer_aware=false builds the base publisher (keeping the sample-miss-detection heartbeat), buffer_aware=true builds per-subscriber buffer-aware endpoints.
  • make() creates the base publisher through create_publisher_endpoint() after construction instead of declaring it inline and passing it to the constructor.
  • publish() / publish_serialized_message() publish through base_endpoint_; shutdown() undeclares base_endpoint_ and every buffer-aware endpoint uniformly.

Verification

Built the rolling stack in Debug (so zenoh-c carries debug_assertions) and ran the previously-failing tests against the patched RMW:

  • Minimal repro (create one publisher and publish a message): no panic, clean exit.
  • test_communication: 0 failures (17/17 message-type pub/sub, plus the launch-based requester/replier/action tests).
  • demo_nodes_cpp / logging_demo / image_tools / pendulum_control: 16/16 pass, covering all the demo/tutorial/pendulum failures from #3865.

Breaking Changes

None.

Did you use Generative AI?

Yes. Claude (claude-opus-4-8) via Claude Code was used to assist with creating an initial prototype version of the changes contained in this PR.

@ahcorde

ahcorde commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Pulls: #990
Gist: https://gist.githubusercontent.com/ahcorde/c0e8b53809a6362e1295c3dfe955b638/raw/dc3af366bd2984f1fbe71795cce23560164475a6/ros2.repos
BUILD args: --packages-above-and-dependencies rmw_zenoh_cpp
TEST args: --packages-above rmw_zenoh_cpp
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/19507

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@Yadunund Yadunund left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lmk if the feedback makes sense! I'm still catching up on the buffer aware changes

if (!is_buffer_aware_) {
auto base_endpoint = std::make_shared<PublisherEndpoint>();
base_endpoint->key = entity_->topic_info()->topic_keyexpr_;
base_endpoint->pub = std::optional<zenoh::ext::AdvancedPublisher>(std::move(pub_));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is fine but leaving a suggestion to improve readability / internal APIs.

We have a PublisherEndpoint::pub but it looks like we only want to assign it a value if the publisher is buffer-aware.

We also have zenoh::ext::AdvancedPublisher pub_ in PublisherData and we have diverging codepaths to reference pub_ or endpoints_->pub depending on whether we are buffer-aware.

Here we directly instantiate std::make_shared<PublisherEndpoint>() and in other places we call PublisherData::create_publisher_endpoint.

To make things more consistent, how about

  1. Moving PublisherData::pub_ into PublisherEndpoint as well. We can either have an std::variant to get a an AdvancedPublisher that is simple or buffer-aware or just have two std::optonal<AdvancedPublisher objects for each type resp.
  2. Update create_publisher_endpoint() to create both types of endpoints by accepting a new argument, eg., bool buffer_aware=false.
  3. Always rely on create_publisher_endpoint() throughout the codebase.

@nvcyc

nvcyc commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

The fix direction is right. The base publisher should not be moved in the first place.
I'm suggesting a more complete removal here: 0d653ae

Every rmw_zenoh_cpp pub/sub test aborted in nightly_linux_debug: the
PublisherData constructor moved pub_ into a base PublisherEndpoint for simple
publishers, leaving pub_ moved-from (null). publish(), publish_serialized_message()
and shutdown() still operated on pub_, so the first put() dereferenced a null
advanced publisher -- debug builds abort in ze_advanced_publisher_loan_mut
(hint::unreachable_unchecked), release builds hit undefined behaviour.

Rather than patch the move, unify the publisher representation so the divergent
pub_ vs endpoints_ codepaths can no longer disagree:

- Remove the PublisherData::pub_ member. The standard/base publisher now lives in
  base_endpoint_ (a PublisherEndpoint), like every other endpoint.
- create_publisher_endpoint() gains a buffer_aware flag and is the single place
  that declares an AdvancedPublisher. buffer_aware=false builds the base publisher
  (with sample-miss-detection heartbeat); buffer_aware=true builds per-subscriber
  buffer-aware endpoints.
- make() creates the base publisher through create_publisher_endpoint() after
  construction instead of declaring it inline and passing it to the constructor.
- publish()/publish_serialized_message() publish through base_endpoint_;
  shutdown() undeclares base_endpoint_ and every buffer-aware endpoint uniformly.

Verified on a Debug rolling build (zenoh-c with debug_assertions): the minimal
publisher, test_communication (17/17 plus launch-based requester/replier/action),
and demo_nodes_cpp/logging_demo/image_tools/pendulum_control (16/16) all pass; the
advanced_publisher.rs panic no longer occurs.
@YuanYuYuan YuanYuYuan force-pushed the fix/advanced-pub-debug-crash branch from 392c30e to e23cb00 Compare June 10, 2026 01:24
@YuanYuYuan

Copy link
Copy Markdown
Contributor Author

Thanks @Yadunund @nvcyc. I went with the full unification rather than just dropping the move.

The reason is long-term maintainability: the root issue isn't only the stray std::move(pub_) — it's that publisher ownership had two divergent representations (PublisherData::pub_ vs endpoints_->pub) with parallel creation and publish/shutdown paths that could (and did) drift out of sync. Patching the move fixes this instance but leaves the same trap for the next change. Collapsing to a single representation makes the inconsistent state unrepresentable.

What changed:

  • Removed PublisherData::pub_. The standard/base publisher now lives in base_endpoint_ (a PublisherEndpoint), like every other endpoint.
  • create_publisher_endpoint(full_key, bool buffer_aware) is now the single place that declares an AdvancedPublisher (buffer_aware=false builds the base publisher and keeps the sample-miss-detection heartbeat; buffer_aware=true builds per-subscriber buffer-aware endpoints).
  • make() creates the base publisher through that path after construction instead of declaring it inline and passing it to the constructor.
  • publish()/publish_serialized_message() publish through base_endpoint_; shutdown() undeclares base_endpoint_ and every buffer-aware endpoint uniformly.

Verified on a Debug rolling build (zenoh-c with debug_assertions): the minimal publisher, test_communication (17/17, including the launch-based requester/replier/action tests), and demo_nodes_cpp/logging_demo/image_tools/pendulum_control (16/16) all pass — the advanced_publisher.rs panic is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consistent failures in rmw_zenoh with debug in nightlies

4 participants