fix: keep pub_ valid for simple publishers (debug nightly crash)#990
fix: keep pub_ valid for simple publishers (debug nightly crash)#990YuanYuYuan wants to merge 1 commit into
Conversation
|
Pulls: #990 |
Yadunund
left a comment
There was a problem hiding this comment.
Lmk if the feedback makes sense! I'm still catching up on the buffer aware changes
| if (!is_buffer_aware_) { | ||
| auto base_endpoint = std::make_shared<PublisherEndpoint>(); | ||
| base_endpoint->key = entity_->topic_info()->topic_keyexpr_; | ||
| base_endpoint->pub = std::optional<zenoh::ext::AdvancedPublisher>(std::move(pub_)); |
There was a problem hiding this comment.
The fix is fine but leaving a suggestion to improve readability / internal APIs.
We have a PublisherEndpoint::pub but it looks like we only want to assign it a value if the publisher is buffer-aware.
We also have zenoh::ext::AdvancedPublisher pub_ in PublisherData and we have diverging codepaths to reference pub_ or endpoints_->pub depending on whether we are buffer-aware.
Here we directly instantiate std::make_shared<PublisherEndpoint>() and in other places we call PublisherData::create_publisher_endpoint.
To make things more consistent, how about
- Moving
PublisherData::pub_intoPublisherEndpointas well. We can either have anstd::variantto get a anAdvancedPublisherthat is simple or buffer-aware or just have twostd::optonal<AdvancedPublisherobjects for each type resp. - Update
create_publisher_endpoint()to create both types of endpoints by accepting a new argument, eg.,bool buffer_aware=false. - Always rely on
create_publisher_endpoint()throughout the codebase.
|
The fix direction is right. The base publisher should not be moved in the first place. |
Every rmw_zenoh_cpp pub/sub test aborted in nightly_linux_debug: the PublisherData constructor moved pub_ into a base PublisherEndpoint for simple publishers, leaving pub_ moved-from (null). publish(), publish_serialized_message() and shutdown() still operated on pub_, so the first put() dereferenced a null advanced publisher -- debug builds abort in ze_advanced_publisher_loan_mut (hint::unreachable_unchecked), release builds hit undefined behaviour. Rather than patch the move, unify the publisher representation so the divergent pub_ vs endpoints_ codepaths can no longer disagree: - Remove the PublisherData::pub_ member. The standard/base publisher now lives in base_endpoint_ (a PublisherEndpoint), like every other endpoint. - create_publisher_endpoint() gains a buffer_aware flag and is the single place that declares an AdvancedPublisher. buffer_aware=false builds the base publisher (with sample-miss-detection heartbeat); buffer_aware=true builds per-subscriber buffer-aware endpoints. - make() creates the base publisher through create_publisher_endpoint() after construction instead of declaring it inline and passing it to the constructor. - publish()/publish_serialized_message() publish through base_endpoint_; shutdown() undeclares base_endpoint_ and every buffer-aware endpoint uniformly. Verified on a Debug rolling build (zenoh-c with debug_assertions): the minimal publisher, test_communication (17/17 plus launch-based requester/replier/action), and demo_nodes_cpp/logging_demo/image_tools/pendulum_control (16/16) all pass; the advanced_publisher.rs panic no longer occurs.
392c30e to
e23cb00
Compare
|
Thanks @Yadunund @nvcyc. I went with the full unification rather than just dropping the move. The reason is long-term maintainability: the root issue isn't only the stray What changed:
Verified on a Debug rolling build (zenoh-c with |
Closes #989.
Summary
Fix a null advanced-publisher dereference that makes every
rmw_zenoh_cpppub/sub test fail in thenightly_linux_debugjob (e.g. nightly_linux_debug #3865 — 265 failures, all the same panic).In debug builds every publish aborts with:
In release builds the same path is undefined behaviour (the check is compiled out), which is why only the debug nightly surfaces it.
Root cause
The per-endpoint refactor (#930) added a base endpoint for simple (non-buffer-aware) publishers. The
PublisherDataconstructor initialisespub_from its argument and then, for simple publishers, movespub_intobase_endpoint->pub:pub_(std::move(pub)), ... base_endpoint->pub = std::optional<zenoh::ext::AdvancedPublisher>(std::move(pub_));After the move,
pub_is in a moved-from (null) state. Butpublish(),publish_serialized_message()andshutdown()all still operate onpub_, so the firstpub_.put()callsze_advanced_publisher_loan_mut()on a null owned publisher, hittingunwrap_unchecked()onNone.Approach
Rather than just remove the stray move, this unifies how a publisher is represented. The deeper problem is that ownership lived in two divergent places —
PublisherData::pub_andendpoints_->pub— with parallel creation and publish/shutdown paths that can drift out of sync (which is exactly how the bug arose). Patching the move fixes this instance but leaves the same trap for the next change. Collapsing to a single representation makes the inconsistent state unrepresentable, which is the better long-term-maintenance outcome (per review discussion).Key Changes
PublisherData::pub_member. The standard/base publisher now lives inbase_endpoint_(aPublisherEndpoint), like every other endpoint.create_publisher_endpoint(full_key, bool buffer_aware)is now the single place that declares anAdvancedPublisher:buffer_aware=falsebuilds the base publisher (keeping the sample-miss-detection heartbeat),buffer_aware=truebuilds per-subscriber buffer-aware endpoints.make()creates the base publisher throughcreate_publisher_endpoint()after construction instead of declaring it inline and passing it to the constructor.publish()/publish_serialized_message()publish throughbase_endpoint_;shutdown()undeclaresbase_endpoint_and every buffer-aware endpoint uniformly.Verification
Built the rolling stack in Debug (so zenoh-c carries
debug_assertions) and ran the previously-failing tests against the patched RMW:test_communication: 0 failures (17/17 message-type pub/sub, plus the launch-based requester/replier/action tests).demo_nodes_cpp/logging_demo/image_tools/pendulum_control: 16/16 pass, covering all the demo/tutorial/pendulum failures from #3865.Breaking Changes
None.
Did you use Generative AI?
Yes. Claude (claude-opus-4-8) via Claude Code was used to assist with creating an initial prototype version of the changes contained in this PR.