-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Parallel configure/activate in lifecycle manager #5541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix some issues due to generative AI. There are unused defined variables and no error logging. Please review your code a bit more carefully before opening PRs with gen AI 😉
A design question: we could, rather than a large set of duplicated code for parallel processing or not, change this to having an if (!parallel_state_transitions_) {future.get();}
type action in the loop where we create the futures. That way we can still do sequentially if we want without much, if any, code duplication to support both features. Then in the 'for each future' loop, we only do that if parallel_state_transitions_
.
Spinning up the threads might take a bit extra time, but as long as its not much, I think that design simplicity is worth an extra couple hundred milliseconds.
More an FYI, but ros2 launch API now has an autostart
field I added so you can autostart lifecycle nodes and components without a manager if you choose.
CI is failing I believe due to another PR merged recently. Can you rebase / pull in main
?
If that doesn't fix it, change all the v39 to v40 in this file https://github.com/ros-navigation/navigation2/blob/main/.circleci/config.yml#L41 (there are 3 of them).
bond_respawn_max_duration_ = rclcpp::Duration::from_seconds(respawn_timeout_s); | ||
|
||
get_parameter("attempt_respawn_reconnection", attempt_respawn_reconnection_); | ||
get_parameter("parallel_state_transitions", parallel_state_transitions_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs.nav2.org needs to be updated with the configuration guide for the new parameter. Also a migration guide entry talking about this feature and some metrics would be nice so other users are aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
if (!success && !hard_change) { | ||
uint8_t state = node_map_[node_name]->get_state(); | ||
if (!strcmp(reinterpret_cast<char *>(&state), "Inactive")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the transition_state_map_
we should probably use corresponding to the transition being completed
if (!success && !hard_change) { | ||
uint8_t state = node_map_[node_name]->get_state(); | ||
if (!strcmp(reinterpret_cast<char *>(&state), "Inactive")) { | ||
inactive_nodes += node_name + delimiter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused.
if (!strcmp(reinterpret_cast<char *>(&state), "Inactive")) { | ||
inactive_nodes += node_name + delimiter; | ||
} else { | ||
unconfigured_nodes += node_name + delimiter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused.
return false; | ||
/* Function partially created using claude */ | ||
size_t active_nodes_count = 0; | ||
std::string nodes_in_error_state = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused.
std::string nodes_in_error_state = ""; | ||
std::string unconfigured_nodes = ""; | ||
std::string inactive_nodes = ""; | ||
std::string delimiter(", "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary, we have the information in the code necessary to check state returns
@jplapp any update? |
Signed-off-by: Johannes Plapp <[email protected]>
8f25ad5
to
9efd24d
Compare
Signed-off-by: Johannes Plapp <[email protected]>
9efd24d
to
bebf1f8
Compare
Thanks a lot for the quick feedback and sorry for the long wait! The code issues (sadly) are to blame on my manual mistakes while porting this feature from our branch to main. I hope it is better now. Some timing on my laptop with our stack current implementation, sequential So the impact of launching the threads seems negligible. |
Codecov Report❌ Patch coverage is
... and 3 files with indirect coverage changes 🚀 New features to boost your workflow:
|
So most of the "slow launch" impact was caused by having used a bond_heartbeat_period of 1.0. There is some remaining speedup but I'm not sure if it's worth the additional complexity. Let me know, then I'll update the docs if needed. |
Basic Info
Description of contribution in a few bullet points
we noticed that our lifecycle nodes don't depend being configured/activated in sequence, so we can speed up robot launch by activating everything at once. For us, on the robot it reduces launch speed (until "all managed nodes are active") from 51 seconds to 35 seconds.
I tried with the simulation from this repo by running some system test:
colcon test --packages-select nav2_system_tests --event-handler=console_direct+ --ctest-args --output-on-failure -R _error_msg$
and after I removed some arbitrary long sleeps from the tester node I got
with this PR: 6 seconds for configure+activate; overall test 50 seconds
without this PR: 7 seconds for configure+activate; overall test 52 seconds (as deactivate is also faster)
so, not that much, but the realworld benefit at least for us is significant. Let me know if that is something you want to add, then we can polish this PR.
Description of documentation updates required from your changes
Description of how this change was tested
Tested on our robot and in the nav2 simulation.
is running on productive robots since a couple weeks, but as this only concerns launching that doesn't mean so much
Future work that may be required in bullet points
For Maintainers:
backport-*
.