Skip to content

Conversation

@benlwalker
Copy link
Contributor

What?

Limit POSIX queue submissions to 16 descriptors outstanding at a time.

Why?

Some network attached filesystems are exhibiting strange performance behavior when we send too many requests to a single file, rather than spreading the work across all the files more evenly. This solves the issue, but keeps enough requests outstanding to still efficiently reach line rate with a single file. On a local file in /tmp, I get greatly improved performance as well, which was unexpected:

Linux AIO with batch size 32 and I/O size 1MB - 3.47x faster than current main branch
URING with batch size 32 and I/O size 1MB - 22x faster than current main branch

How?

It only submits up to 16 elements out of a batch at any given time. The batching is transparent to callers - they continue polling checkXfer() until the batch completes - but this does mean that users need to actually call checkXfer to make forward progress. We expect to later introduce a progress thread to the POSIX backend so that progress will be made regardless of whether the user calls checkXfer.

I ran data verification tests for all impacted use cases with nixlbench and they pass.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi benlwalker! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@anton-nayshtut
Copy link
Contributor

+1

// More IOs to submit, signal to caller to submit more
return NIXL_IN_PROG;
}
// All IOs submitted and completed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_ios_submitted_total needs to be zeroed here to enable the next sending attempt. Same applies to other queues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Limit POSIX queue submissions to 16 descriptors outstanding at a
time. The batching is transparent to callers - they continue polling
checkXfer() until the batch completes - but this does mean that users
need to actually call checkXfer to make forward progress. We expect
to later introduce a progress thread to the POSIX backend so that
progress will be made regardless of whether the user calls checkXfer.

Some network attached filesystems are exhibiting strange performance
behavior when we send too many requests to a single file. This solves
the issue, but keeps enough requests outstanding to still efficiently
reach line rate.

Signed-off-by: Ben Walker <[email protected]>
Copy link
Contributor

@vvenkates27 vvenkates27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, the overall batching approach looks good, but state management between xfers needs to be fixed before merging for correctness.

return NIXL_SUCCESS;
}

nixl_status_t aioQueue::checkCompleted() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed is not reset. I think this may be a bug. Without this reset, subsequent transfers will skip checking IOs because completed[i] remains true from the previous transfer, causing an infinite loop in checkCompleted().
Probably need something like
num_completed = 0;
completed.assign(num_entries, false);

}
return NIXL_ERR_BACKEND;
// If we hit the kernel limit, stop submitting and return partial success
NIXL_ERROR << absl::StrFormat(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using NIXL_INFO? Since EAGAIN is an expected condition that the batching logic handles gracefully (returning NIXL_SUCCESS with partial count), logging it as an error may be misleading?

// More IOs to submit, signal to caller to submit more
return NIXL_IN_PROG;
}
// All IOs submitted and completed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return NIXL_SUCCESS;
}

nixl_status_t UringQueue::checkCompleted() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the other two APIs.

// io_submit can return partial submissions
submitted_count = ret;
if (ret != count) {
NIXL_ERROR << absl::StrFormat("linux_aio submit partial: %d/%d", ret, count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.. NIXL_ERROR here is misleading.

nixl_status_t UringQueue::checkCompleted() {
if (num_completed == num_entries) {
// Check if all IOs are submitted and completed
if (num_ios_submitted_total >= num_ios_to_submit && num_completed == num_ios_to_submit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use if (num_ios_submitted_total >= num_ios_to_submit && num_ios_outstanding == 0) in aio_queue.cpp, probably makes sense to keep them uniform as num_ios_outstanding is managed by the base class and is more reliable?

int num_ios_to_submit = 0;
int num_ios_submitted_total = 0;
int num_ios_outstanding = 0;
static constexpr int MAX_IO_OUTSTANDING = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments why we need this, its probably a good idea to add a README for POSIX plugin as the knobs and usage increases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants