Skip to content

Conversation

@akkart-aws
Copy link
Collaborator


What?

Implemented notification message fragmentation in the NIXL libfabric backend to support unlimited notification sizes by splitting messages into 256-byte chunks.

Why?

TensorRT-LLM's disaggregated serving architecture requires exchanging connection metadata between nodes. On P5 instances with 32 EFA devices, this connection data is approximately 1,792 bytes (32 devices × 56 bytes per endpoint), which exceeds the previous 1,024-byte notification limit, causing connection establishment to fail.

This change enables TensorRT-LLM to work on P5 instances and scales to support 256+ rails per node (~14KB connection data).

How?

Sender-Side:

  • Fragmentation happens once in prepXfer(), splitting messages into 256-byte chunks
  • Pre-fragmented notifications stored in backend handle
  • notifSendPriv() sends all fragments with sequence metadata

Receiver-Side:

  • Fragments stored in pending map indexed by xfer_id
  • Dual-level tracking: waits for both fragment completion AND write completion
  • checkPendingNotifications() reassembles complete messages

Additional Improvements:

  • Added recv pool pre-posting (128 requests per rail) for improved throughput
  • Added __attribute__((packed)) to BinaryNotification for cross-platform compatibility
  • Extracted fragmentNotificationMessage() helper function for code reusability

Modified Files:

  • src/utils/libfabric/libfabric_common.h
  • src/plugins/libfabric/libfabric_backend.h
  • src/plugins/libfabric/libfabric_backend.cpp
  • src/utils/libfabric/libfabric_rail.cpp

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi akkart-aws! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@akkart-aws akkart-aws changed the title Increase notif size feat(libfabric): Implement notification fragmentation for large messages Dec 18, 2025
@akkart-aws akkart-aws marked this pull request as draft December 18, 2025 08:43
@akkart-aws akkart-aws marked this pull request as ready for review December 18, 2025 10:42
@akkart-aws akkart-aws force-pushed the increase_notif_size branch 2 times, most recently from b4ae63a to dd24c6d Compare December 19, 2025 08:28
Copy link
Collaborator

@fengjica fengjica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I'm just sending out a partial review for now, and will keep reviewing it later.

Copy link
Collaborator

@amitrad-aws amitrad-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of the intermidiate commits here are misleading in that they change the behavior to something we wouldn't want to do (even as an intermidiate change).
I would squash "add notification fragmentation...", "add attribute((packed))...", "auto-calculate fragment size..." into one commit.

@akkart-aws akkart-aws force-pushed the increase_notif_size branch 4 times, most recently from 08dc93b to 87ba2c6 Compare December 29, 2025 10:27
fengjica
fengjica previously approved these changes Jan 2, 2026
@akkart-aws akkart-aws force-pushed the increase_notif_size branch 2 times, most recently from f8f6e9c to ccde7f9 Compare January 5, 2026 18:43
- Update comments to accurately reflect request tracking initialization
  and notification handling
- Change `pending_notifications_` from std::map to std::unordered_map
  for O(1) lookup performance
- Improve inline documentation for PendingNotification struct fields

These changes improve code readability and maintainability without
altering functionality.
- Use Emplace for map operations to avoid additional copies.
- Use passbyreference in fragmentNotificatioMessage to prevent
vector copy.

Signed-off-by: Arun Karthik <[email protected]>
…ders

Some of the notification fragment header fields are common across
all the notification fragments. This commit takes such fields and puts
them into a spearate header for first notification fragment.

This commit also moves the agent name from the header and combines it
with a common notification buffer.
@akkart-aws akkart-aws force-pushed the increase_notif_size branch from 2f9f3ea to 561c0a5 Compare January 6, 2026 09:20
@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 6, 2026

/build

@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 6, 2026

/ok to test b2b6849

@akkart-aws akkart-aws merged commit 0f4fcae into ai-dynamo:main Jan 7, 2026
17 of 18 checks passed
ovidiusm pushed a commit to ovidiusm/nixl that referenced this pull request Jan 8, 2026
…ges (ai-dynamo#1142)

* fix(libfabric): Improve hexDump

* refactor(libfabric): Move xfer_id allocation from request pool to caller

* refactor(libfabric): move xfer_id from BinaryNotification to backend handle

* refactor(libfabric): simplify pending notification map logic and fix constructor

* refactor(Libfabric): Modify sendNotifPriv to use vector of BinaryNotif

* refac(libfaric): unify notif processing for all expected_completions values

* feat(libfabric): add notification fragmentation on sender and reassembly on receiver

* refactor(libfabric): rename method names and some code clarity

- Update comments to accurately reflect request tracking initialization
and notification handling
- Change `pending_notifications_` from std::map to std::unordered_map
  for O(1) lookup performance
- Improve inline documentation for PendingNotification struct fields

These changes improve code readability and maintainability without
altering functionality.

* rev(libfabric): Fix comments on the notification refactor

- Use Emplace for map operations to avoid additional copies.
- Use passbyreference in fragmentNotificatioMessage to prevent
vector copy.

Signed-off-by: Arun Karthik <[email protected]>

* rev(libfabric): Split notification fragments header into separate headers

Some of the notification fragment header fields are common across
all the notification fragments. This commit takes such fields and puts
them into a spearate header for first notification fragment.

This commit also moves the agent name from the header and combines it
with a common notification buffer.

* rev(Libfabric): Remove Submitted requests as function callback

Signed-off-by: Arun Karthik <[email protected]>

* fix(libfabric): Update copyright year

---------

Signed-off-by: Arun Karthik <[email protected]>
@akkart-aws akkart-aws deleted the increase_notif_size branch January 9, 2026 19:05
nv-nmailhot pushed a commit that referenced this pull request Jan 12, 2026
…ges (#1142) (#1182)

* fix(libfabric): Improve hexDump

* refactor(libfabric): Move xfer_id allocation from request pool to caller

* refactor(libfabric): move xfer_id from BinaryNotification to backend handle

* refactor(libfabric): simplify pending notification map logic and fix constructor

* refactor(Libfabric): Modify sendNotifPriv to use vector of BinaryNotif

* refac(libfaric): unify notif processing for all expected_completions values

* feat(libfabric): add notification fragmentation on sender and reassembly on receiver

* refactor(libfabric): rename method names and some code clarity

- Update comments to accurately reflect request tracking initialization
and notification handling
- Change `pending_notifications_` from std::map to std::unordered_map
  for O(1) lookup performance
- Improve inline documentation for PendingNotification struct fields

These changes improve code readability and maintainability without
altering functionality.

* rev(libfabric): Fix comments on the notification refactor

- Use Emplace for map operations to avoid additional copies.
- Use passbyreference in fragmentNotificatioMessage to prevent
vector copy.



* rev(libfabric): Split notification fragments header into separate headers

Some of the notification fragment header fields are common across
all the notification fragments. This commit takes such fields and puts
them into a spearate header for first notification fragment.

This commit also moves the agent name from the header and combines it
with a common notification buffer.

* rev(Libfabric): Remove Submitted requests as function callback



* fix(libfabric): Update copyright year

---------

Signed-off-by: Arun Karthik <[email protected]>
Co-authored-by: Arun Karthik <[email protected]>
Co-authored-by: Mikhail Brinskiy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants