-
Notifications
You must be signed in to change notification settings - Fork 224
feat(libfabric): Implement notification fragmentation for large messages #1142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi akkart-aws! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
1fab141 to
a1c185a
Compare
b4ae63a to
dd24c6d
Compare
fengjica
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I'm just sending out a partial review for now, and will keep reviewing it later.
amitrad-aws
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of the intermidiate commits here are misleading in that they change the behavior to something we wouldn't want to do (even as an intermidiate change).
I would squash "add notification fragmentation...", "add attribute((packed))...", "auto-calculate fragment size..." into one commit.
08dc93b to
87ba2c6
Compare
87ba2c6 to
8f7edaf
Compare
8f7edaf to
629b15a
Compare
f8f6e9c to
ccde7f9
Compare
- Update comments to accurately reflect request tracking initialization and notification handling - Change `pending_notifications_` from std::map to std::unordered_map for O(1) lookup performance - Improve inline documentation for PendingNotification struct fields These changes improve code readability and maintainability without altering functionality.
- Use Emplace for map operations to avoid additional copies. - Use passbyreference in fragmentNotificatioMessage to prevent vector copy. Signed-off-by: Arun Karthik <[email protected]>
…ders Some of the notification fragment header fields are common across all the notification fragments. This commit takes such fields and puts them into a spearate header for first notification fragment. This commit also moves the agent name from the header and combines it with a common notification buffer.
Signed-off-by: Arun Karthik <[email protected]>
2f9f3ea to
561c0a5
Compare
561c0a5 to
b2b6849
Compare
|
/build |
|
/ok to test b2b6849 |
…ges (ai-dynamo#1142) * fix(libfabric): Improve hexDump * refactor(libfabric): Move xfer_id allocation from request pool to caller * refactor(libfabric): move xfer_id from BinaryNotification to backend handle * refactor(libfabric): simplify pending notification map logic and fix constructor * refactor(Libfabric): Modify sendNotifPriv to use vector of BinaryNotif * refac(libfaric): unify notif processing for all expected_completions values * feat(libfabric): add notification fragmentation on sender and reassembly on receiver * refactor(libfabric): rename method names and some code clarity - Update comments to accurately reflect request tracking initialization and notification handling - Change `pending_notifications_` from std::map to std::unordered_map for O(1) lookup performance - Improve inline documentation for PendingNotification struct fields These changes improve code readability and maintainability without altering functionality. * rev(libfabric): Fix comments on the notification refactor - Use Emplace for map operations to avoid additional copies. - Use passbyreference in fragmentNotificatioMessage to prevent vector copy. Signed-off-by: Arun Karthik <[email protected]> * rev(libfabric): Split notification fragments header into separate headers Some of the notification fragment header fields are common across all the notification fragments. This commit takes such fields and puts them into a spearate header for first notification fragment. This commit also moves the agent name from the header and combines it with a common notification buffer. * rev(Libfabric): Remove Submitted requests as function callback Signed-off-by: Arun Karthik <[email protected]> * fix(libfabric): Update copyright year --------- Signed-off-by: Arun Karthik <[email protected]>
…ges (#1142) (#1182) * fix(libfabric): Improve hexDump * refactor(libfabric): Move xfer_id allocation from request pool to caller * refactor(libfabric): move xfer_id from BinaryNotification to backend handle * refactor(libfabric): simplify pending notification map logic and fix constructor * refactor(Libfabric): Modify sendNotifPriv to use vector of BinaryNotif * refac(libfaric): unify notif processing for all expected_completions values * feat(libfabric): add notification fragmentation on sender and reassembly on receiver * refactor(libfabric): rename method names and some code clarity - Update comments to accurately reflect request tracking initialization and notification handling - Change `pending_notifications_` from std::map to std::unordered_map for O(1) lookup performance - Improve inline documentation for PendingNotification struct fields These changes improve code readability and maintainability without altering functionality. * rev(libfabric): Fix comments on the notification refactor - Use Emplace for map operations to avoid additional copies. - Use passbyreference in fragmentNotificatioMessage to prevent vector copy. * rev(libfabric): Split notification fragments header into separate headers Some of the notification fragment header fields are common across all the notification fragments. This commit takes such fields and puts them into a spearate header for first notification fragment. This commit also moves the agent name from the header and combines it with a common notification buffer. * rev(Libfabric): Remove Submitted requests as function callback * fix(libfabric): Update copyright year --------- Signed-off-by: Arun Karthik <[email protected]> Co-authored-by: Arun Karthik <[email protected]> Co-authored-by: Mikhail Brinskiy <[email protected]>
What?
Implemented notification message fragmentation in the NIXL libfabric backend to support unlimited notification sizes by splitting messages into 256-byte chunks.
Why?
TensorRT-LLM's disaggregated serving architecture requires exchanging connection metadata between nodes. On P5 instances with 32 EFA devices, this connection data is approximately 1,792 bytes (32 devices × 56 bytes per endpoint), which exceeds the previous 1,024-byte notification limit, causing connection establishment to fail.
This change enables TensorRT-LLM to work on P5 instances and scales to support 256+ rails per node (~14KB connection data).
How?
Sender-Side:
prepXfer(), splitting messages into 256-byte chunksnotifSendPriv()sends all fragments with sequence metadataReceiver-Side:
checkPendingNotifications()reassembles complete messagesAdditional Improvements:
__attribute__((packed))to BinaryNotification for cross-platform compatibilityfragmentNotificationMessage()helper function for code reusabilityModified Files:
src/utils/libfabric/libfabric_common.hsrc/plugins/libfabric/libfabric_backend.hsrc/plugins/libfabric/libfabric_backend.cppsrc/utils/libfabric/libfabric_rail.cpp