-
Notifications
You must be signed in to change notification settings - Fork 217
posix: Implement batched descriptor submission with 16-descriptor limit #1113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi benlwalker! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
eeaa6ec to
ad58540
Compare
ad58540 to
2bba943
Compare
|
+1 |
| // More IOs to submit, signal to caller to submit more | ||
| return NIXL_IN_PROG; | ||
| } | ||
| // All IOs submitted and completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_ios_submitted_total needs to be zeroed here to enable the next sending attempt. Same applies to other queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Limit POSIX queue submissions to 16 descriptors outstanding at a time. The batching is transparent to callers - they continue polling checkXfer() until the batch completes - but this does mean that users need to actually call checkXfer to make forward progress. We expect to later introduce a progress thread to the POSIX backend so that progress will be made regardless of whether the user calls checkXfer. Some network attached filesystems are exhibiting strange performance behavior when we send too many requests to a single file. This solves the issue, but keeps enough requests outstanding to still efficiently reach line rate. Signed-off-by: Ben Walker <[email protected]>
2bba943 to
f60b481
Compare
vvenkates27
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR, the overall batching approach looks good, but state management between xfers needs to be fixed before merging for correctness.
| return NIXL_SUCCESS; | ||
| } | ||
|
|
||
| nixl_status_t aioQueue::checkCompleted() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed is not reset. I think this may be a bug. Without this reset, subsequent transfers will skip checking IOs because completed[i] remains true from the previous transfer, causing an infinite loop in checkCompleted().
Probably need something like
num_completed = 0;
completed.assign(num_entries, false);
| } | ||
| return NIXL_ERR_BACKEND; | ||
| // If we hit the kernel limit, stop submitting and return partial success | ||
| NIXL_ERROR << absl::StrFormat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using NIXL_INFO? Since EAGAIN is an expected condition that the batching logic handles gracefully (returning NIXL_SUCCESS with partial count), logging it as an error may be misleading?
| // More IOs to submit, signal to caller to submit more | ||
| return NIXL_IN_PROG; | ||
| } | ||
| // All IOs submitted and completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| return NIXL_SUCCESS; | ||
| } | ||
|
|
||
| nixl_status_t UringQueue::checkCompleted() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as the other two APIs.
| // io_submit can return partial submissions | ||
| submitted_count = ret; | ||
| if (ret != count) { | ||
| NIXL_ERROR << absl::StrFormat("linux_aio submit partial: %d/%d", ret, count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.. NIXL_ERROR here is misleading.
| nixl_status_t UringQueue::checkCompleted() { | ||
| if (num_completed == num_entries) { | ||
| // Check if all IOs are submitted and completed | ||
| if (num_ios_submitted_total >= num_ios_to_submit && num_completed == num_ios_to_submit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use if (num_ios_submitted_total >= num_ios_to_submit && num_ios_outstanding == 0) in aio_queue.cpp, probably makes sense to keep them uniform as num_ios_outstanding is managed by the base class and is more reliable?
| int num_ios_to_submit = 0; | ||
| int num_ios_submitted_total = 0; | ||
| int num_ios_outstanding = 0; | ||
| static constexpr int MAX_IO_OUTSTANDING = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments why we need this, its probably a good idea to add a README for POSIX plugin as the knobs and usage increases.
What?
Limit POSIX queue submissions to 16 descriptors outstanding at a time.
Why?
Some network attached filesystems are exhibiting strange performance behavior when we send too many requests to a single file, rather than spreading the work across all the files more evenly. This solves the issue, but keeps enough requests outstanding to still efficiently reach line rate with a single file. On a local file in /tmp, I get greatly improved performance as well, which was unexpected:
Linux AIO with batch size 32 and I/O size 1MB - 3.47x faster than current main branch
URING with batch size 32 and I/O size 1MB - 22x faster than current main branch
How?
It only submits up to 16 elements out of a batch at any given time. The batching is transparent to callers - they continue polling checkXfer() until the batch completes - but this does mean that users need to actually call checkXfer to make forward progress. We expect to later introduce a progress thread to the POSIX backend so that progress will be made regardless of whether the user calls checkXfer.
I ran data verification tests for all impacted use cases with nixlbench and they pass.