Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recycle serialization buffers on transmission #342

Open
wants to merge 16 commits into
base: rolling
Choose a base branch
from

Conversation

fuzzypixelz
Copy link
Contributor

Adds a LIFO buffer pool in the context to reuse buffers allocated on serialization. The aim is not (only) to avoid the overhead of dynamic allocation but rather to enhance the cache locality of serialization buffers.

@fuzzypixelz

This comment was marked as outdated.

@fuzzypixelz fuzzypixelz changed the title Recycle serialization buffers on transmission. Recycle serialization buffers on transmission Dec 16, 2024
@clalancette
Copy link
Collaborator

All right, now that we've merged in #327 , we can consider this one. Please rebase this onto the latest, then we can do a full review of it. Until then, I'll mark it as a draft.

@clalancette clalancette marked this pull request as draft December 17, 2024 21:41
@YuanYuYuan YuanYuYuan mentioned this pull request Dec 18, 2024
@fuzzypixelz fuzzypixelz force-pushed the buffer-pool branch 2 times, most recently from 068cf50 to 21006d0 Compare December 19, 2024 11:32
Copy link
Contributor

@ahcorde ahcorde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also many changes unrelated with the goal of this PR

@fuzzypixelz fuzzypixelz force-pushed the buffer-pool branch 2 times, most recently from 7ca544b to bb6fd88 Compare December 19, 2024 14:46
@fuzzypixelz
Copy link
Contributor Author

There are also many changes unrelated with the goal of this PR

There was a formatting error from my IDE. I've restored the files and manually re-applied the patches.

@fuzzypixelz fuzzypixelz force-pushed the buffer-pool branch 5 times, most recently from 8dd9bf5 to bcc36a1 Compare December 20, 2024 16:00
Copy link
Collaborator

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the comments inline, do you have any updated performance numbers here?

@fuzzypixelz

This comment was marked as outdated.

@fuzzypixelz fuzzypixelz marked this pull request as ready for review January 3, 2025 15:20
@Yadunund Yadunund self-assigned this Jan 3, 2025
@fuzzypixelz fuzzypixelz force-pushed the buffer-pool branch 2 times, most recently from 6c48aa8 to 6143959 Compare January 17, 2025 09:58
Copy link
Member

@Yadunund Yadunund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. I've left some feedback to re-structure the code a bit.

@fuzzypixelz
Copy link
Contributor Author

@clalancette @Yadunund The numbers I've shown are for single-process (session-local) communication. I would like to investigate the impact of this pull request on multi-process communication on the same host as well, to make sure we're not degrading performance in that case.

Once that's done, I will post numbers and address your remaining review comments.

fuzzypixelz and others added 9 commits February 6, 2025 16:22
Adds a bounded LIFO buffer pool in the context to reuse buffers
allocated on serialization. The aim is not (only) to avoid the
overhead of dynamic allocation but rather to enhance the cache
locality of serialization buffers.
Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Mahmoud Mazouz <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Mahmoud Mazouz <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Mahmoud Mazouz <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Mahmoud Mazouz <[email protected]>
@Yadunund
Copy link
Member

Yadunund commented Feb 7, 2025

@clalancette @Yadunund The numbers I've shown are for single-process (session-local) communication. I would like to investigate the impact of this pull request on multi-process communication on the same host as well, to make sure we're not degrading performance in that case.

Once that's done, I will post numbers and address your remaining review comments.

@fuzzypixelz any update on performance metrics after the recent set of changes? Is this ready for another review?

@fuzzypixelz
Copy link
Contributor Author

fuzzypixelz commented Feb 7, 2025

@clalancette @Yadunund Here are the benchmarking results. I have used the iRobot benchmark here in single-process and multi-process modes, using the Mont Blanc topology with IPC disabled (only relevant for single-process mode). I used four benchmarking machines with varying specs to make sure we perform properly on both low-end and high-end devices.

Host 1

System information

  • CPU: AMD EPYC 7502 32-Core Processor
  • MEM: 512G

Single-process

Q2 plot_q2_us_combined
Q3 plot_q3_us_combined
Mean plot_mean_us_combined

Multi-process

Q2 plot_q2_us_combined
Q3 plot_q3_us_combined
Mean plot_mean_us_combined

Host 2

System information

  • CPU: 12th Gen Intel(R) Core(TM) i5-1240P
  • MEM: 16G

Single-process

Q2 plot_q2_us_combined
Q3 plot_q3_us_combined
Mean plot_mean_us_combined

Multi-process

Q2 plot_q2_us_combined
Q3 plot_q3_us_combined
Mean plot_mean_us_combined

Host 3

System information

  • CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  • MEM: 8G

Single-process

Q2 plot_q2_us_combined
Q3 plot_q3_us_combined
Mean plot_mean_us_combined

Multi-process

Q2 plot_q2_us_combined
Q3 plot_q3_us_combined
Mean plot_mean_us_combined

Host 4

  • CPU: ARM Cortex-A76
  • MEM: 8G

Single-process

Q2 plot_q2_us_combined
Q3 plot_q3_us_combined
Mean plot_mean_us_combined

Multi-process

Q2 plot_q2_us_combined
Q3 plot_q3_us_combined
Mean plot_mean_us_combined

@fuzzypixelz
Copy link
Contributor Author

fuzzypixelz commented Feb 7, 2025

The above comment turned out too long, I should probably use collapsed sections.

@Yadunund My conclusion is that this pull request consistently improves latency in intra-process communication for relatively large topics: these are columbia (250 KB) and tagus (250 KB). But there are data points where latency is worse, especially for some topics in intra-process communication.

My attempts to ascertain the root cause of these apparent regressions have not been successful. However, I believe this pull request definitely solves a real problem. I consistently observe lower latency for this pull request on large topics (hundreds of KBs).

I'm not very confident in the reliability of the iRobot benchmark. On one hand, there are issues with the accuracy of measurements; I ran all tests for 60 seconds on idle machines, otherwise results would significantly very from run to run.

Of course I'm not saying that these problematic numbers are meaningless, they could be signs of a real problem. But my confidence in them is rather low, especially when the difference is on the order of tens of microseconds.

I still think that this change is a necessary first step that "just makes sense" and brings rmw_zenoh in line with other RMWs. But there are clearly opportunities for refinement. This can be the subject of future work.

@Yadunund
Copy link
Member

@fuzzypixelz thanks a lot for the detailed study! I'll take a closer look later this week.

@fuzzypixelz
Copy link
Contributor Author

@Yadunund This should be ready to merge.

@fuzzypixelz fuzzypixelz requested review from Yadunund and ahcorde March 11, 2025 15:15
Copy link
Contributor

@ahcorde ahcorde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merge this branch with rolling and if you run

colcon test --merge-install --event-handlers console_direct+ --packages-select rc

you will see some new failures

	107 - test_publisher__rmw_zenoh_cpp (Failed)
	108 - test_publisher_wait_all_ack__rmw_zenoh_cpp (Failed)
	110 - test_subscription__rmw_zenoh_cpp (Failed)
	113 - test_logging_rosout__rmw_zenoh_cpp (Failed)
	117 - test_service_event_publisher__rmw_zenoh_cpp (Failed)

@Yadunund
Copy link
Member

@fuzzypixelz let's revisit this after the kilted freeze.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants