Skip to content

Conversation

jplapp
Copy link
Contributor

@jplapp jplapp commented Sep 19, 2025


Basic Info

Info Please fill out this column
Ticket(s) this addresses none
Primary OS tested on Ubuntu
Robotic platform tested on Pixel PT
Does this PR contain AI generated software? yes
Was this PR description generated by AI software? no

Description of contribution in a few bullet points

we noticed that our lifecycle nodes don't depend being configured/activated in sequence, so we can speed up robot launch by activating everything at once. For us, on the robot it reduces launch speed (until "all managed nodes are active") from 51 seconds to 35 seconds.

I tried with the simulation from this repo by running some system test:
colcon test --packages-select nav2_system_tests --event-handler=console_direct+ --ctest-args --output-on-failure -R _error_msg$

and after I removed some arbitrary long sleeps from the tester node I got
with this PR: 6 seconds for configure+activate; overall test 50 seconds
without this PR: 7 seconds for configure+activate; overall test 52 seconds (as deactivate is also faster)

so, not that much, but the realworld benefit at least for us is significant. Let me know if that is something you want to add, then we can polish this PR.

Description of documentation updates required from your changes

Description of how this change was tested

Tested on our robot and in the nav2 simulation.

is running on productive robots since a couple weeks, but as this only concerns launching that doesn't mean so much

Future work that may be required in bullet points

For Maintainers:

  • Check that any new parameters added are updated in docs.nav2.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists
  • Should this be backported to current distributions? If so, tag with backport-*.

Copy link
Member

@SteveMacenski SteveMacenski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix some issues due to generative AI. There are unused defined variables and no error logging. Please review your code a bit more carefully before opening PRs with gen AI 😉


A design question: we could, rather than a large set of duplicated code for parallel processing or not, change this to having an if (!parallel_state_transitions_) {future.get();} type action in the loop where we create the futures. That way we can still do sequentially if we want without much, if any, code duplication to support both features. Then in the 'for each future' loop, we only do that if parallel_state_transitions_.

Spinning up the threads might take a bit extra time, but as long as its not much, I think that design simplicity is worth an extra couple hundred milliseconds.


More an FYI, but ros2 launch API now has an autostart field I added so you can autostart lifecycle nodes and components without a manager if you choose.


CI is failing I believe due to another PR merged recently. Can you rebase / pull in main?

If that doesn't fix it, change all the v39 to v40 in this file https://github.com/ros-navigation/navigation2/blob/main/.circleci/config.yml#L41 (there are 3 of them).

bond_respawn_max_duration_ = rclcpp::Duration::from_seconds(respawn_timeout_s);

get_parameter("attempt_respawn_reconnection", attempt_respawn_reconnection_);
get_parameter("parallel_state_transitions", parallel_state_transitions_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs.nav2.org needs to be updated with the configuration guide for the new parameter. Also a migration guide entry talking about this feature and some metrics would be nice so other users are aware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


if (!success && !hard_change) {
uint8_t state = node_map_[node_name]->get_state();
if (!strcmp(reinterpret_cast<char *>(&state), "Inactive")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the transition_state_map_ we should probably use corresponding to the transition being completed

if (!success && !hard_change) {
uint8_t state = node_map_[node_name]->get_state();
if (!strcmp(reinterpret_cast<char *>(&state), "Inactive")) {
inactive_nodes += node_name + delimiter;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused.

if (!strcmp(reinterpret_cast<char *>(&state), "Inactive")) {
inactive_nodes += node_name + delimiter;
} else {
unconfigured_nodes += node_name + delimiter;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused.

return false;
/* Function partially created using claude */
size_t active_nodes_count = 0;
std::string nodes_in_error_state = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused.

std::string nodes_in_error_state = "";
std::string unconfigured_nodes = "";
std::string inactive_nodes = "";
std::string delimiter(", ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, we have the information in the code necessary to check state returns

@SteveMacenski
Copy link
Member

@jplapp any update?

@jplapp jplapp force-pushed the lifecycleparallel branch from 8f25ad5 to 9efd24d Compare October 16, 2025 20:58
Signed-off-by: Johannes Plapp <[email protected]>
@jplapp jplapp force-pushed the lifecycleparallel branch from 9efd24d to bebf1f8 Compare October 16, 2025 20:59
@jplapp
Copy link
Contributor Author

jplapp commented Oct 16, 2025

Thanks a lot for the quick feedback and sorry for the long wait! The code issues (sadly) are to blame on my manual mistakes while porting this feature from our branch to main. I hope it is better now.

Some timing on my laptop with our stack

current implementation, sequential
configuring: 2900ms
activating: 3400ms
this implementation, sequential
configuring: 2900ms
activating: 3400ms
this implementation, parallel:
configuring: 2100ms
activating: 2300ms

So the impact of launching the threads seems negligible.
I also found that a significant slowdown during sequential activation is waiting for the first bond heartbeat, which in default settings is 1 second for each node (and the waiting takes exactly 1 second). I could not find however a simple way to speed this up. I think however, that most of the speedup can be achieved by finding a way to speed up this process. Might be even better than adding this setting and logic?
If you have some pointer on where to look here, I'd be happy to dig further.
(update: default value is 0.1, so the impact of this is small, updated timing)

Copy link

codecov bot commented Oct 16, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
nav2_lifecycle_manager/src/lifecycle_manager.cpp 66.66% 8 Missing ⚠️
Files with missing lines Coverage Δ
nav2_lifecycle_manager/src/lifecycle_manager.cpp 88.57% <66.66%> (-0.71%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jplapp
Copy link
Contributor Author

jplapp commented Oct 17, 2025

So most of the "slow launch" impact was caused by having used a bond_heartbeat_period of 1.0. There is some remaining speedup but I'm not sure if it's worth the additional complexity. Let me know, then I'll update the docs if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants