Skip to content

Fix POSIX backend queue fallback handling#1605

Open
aknvda wants to merge 7 commits into
mainfrom
aniket/posix-backend-fixes
Open

Fix POSIX backend queue fallback handling#1605
aknvda wants to merge 7 commits into
mainfrom
aniket/posix-backend-fixes

Conversation

@aknvda
Copy link
Copy Markdown

@aknvda aknvda commented May 1, 2026

What

  • Guard POSIX postXfer against an unavailable I/O queue.
  • Treat POSIX backend initialization as failed when the selected queue type cannot create a queue.
  • Let POSIX tests use the backend compiled default queue instead of forcing Linux AIO.
  • Make POSIX request-handle release null-safe and delete the allocated handle.

Why

Some build environments have POSIX AIO available but not Linux AIO/liburing. The existing test path forced Linux AIO and could hit a null queue, causing a SIGSEGV in nixlPosixBackendReqH::postXfer.

Testing

  • git diff --check
  • meson setup --wipe build-posix -Dbuildtype=debug -Dbuild_tests=true -Dbuild_examples=false -Denable_plugins=POSIX
  • ninja -C build-posix
  • meson test -C build-posix nixl:posix_plugin_test --print-errorlogs

Split out from the LIBFABRIC postXfer thread-pool PR (#1581) so the POSIX fix can be reviewed independently: #1581

Summary by CodeRabbit

  • Bug Fixes

    • Improved POSIX backend initialization with additional checks, clearer error logging, and fail-fast behavior when required I/O queues are unavailable.
    • Safer request cleanup with early-return on missing handles to prevent invalid operations.
  • Tests

    • POSIX tests now default to the compiled queue; non-io_uring paths report "default" instead of "AIO" and show clearer guidance for explicit io_uring.
    • Added compile-gated helpers to validate supported POSIX queue options.
  • Chores

    • Test build passes conditional compile-time flags for POSIX test variants.
  • Chores

    • CI test matrix updated to target a new Slurm head node for job allocation.

Review Change Stack

Guard POSIX request posting against unavailable I/O queues and let tests use the backend's compiled default queue instead of forcing Linux AIO.

Co-authored-by: OpenAI Codex <codex@openai.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds null-checking for the POSIX backend I/O queue in postXfer, fails engine initialization when a requested queue type cannot be instantiated, changes request-handle cleanup to delete allocated objects (and early-return on null), and updates tests and Meson wiring to use and validate the backend's compiled default queue.

Changes

POSIX Backend Changes

Layer / File(s) Summary
Core Safety
src/plugins/posix/posix_backend.cpp
nixlPosixBackendReqH::postXfer() now checks io_queue_ for null, logs an error, and returns NIXL_ERR_BACKEND if uninitialized.
Initialization / Wiring
src/plugins/posix/posix_backend.cpp
nixlPosixEngine constructor now checks that io_queue_ was successfully created; on null sets initErr and logs the requested queue type.
Resource Cleanup
src/plugins/posix/posix_backend.cpp
nixlPosixEngine::releaseReqH returns early for null handles and deletes the nixlPosixBackendReqH handle instead of calling its destructor directly; preserves existing exception handling.
Tests & Build
test/unit/plugins/posix/meson.build, test/unit/plugins/posix/nixl_posix_test.cpp
Meson now sets -DHAVE_LINUXAIO / -DHAVE_LIBURING / -DHAVE_POSIXAIO for the test target when available; tests add helpers to validate requested queue support, stop forcing use_aio="true" for non-io_uring runs, and update help/error text and printed backend label to reflect the compiled default queue.
CI config
.ci/jenkins/lib/test-dl-matrix.yaml
Update SLURM_HEAD_NODE value to dlcluster-login-03.nvidia.com.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

external-contribution, size/M

Suggested reviewers

  • vvenkates27
  • w1ldptr
  • brminich

Poem

🐰 I sniffed a missing queue in the night,
A quiet null that gave quite a fright.
I logged the clue, returned with care,
Deleted loose crumbs, left defaults to fare.
Hop, hop—tests now know what queues are there.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing POSIX backend queue fallback handling through null-safety checks and queue initialization validation.
Description check ✅ Passed The PR description follows the template structure with clear What/Why sections; it explains the core issue, the solution approach, and provides specific testing steps.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch aniket/posix-backend-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@aknvda aknvda requested a review from aranadive May 1, 2026 00:05
Comment thread src/plugins/posix/posix_backend.cpp Outdated
Comment thread src/plugins/posix/posix_backend.cpp Outdated
Split POSIX queue initialization errors into separate logs and clarify request-handle deletion by using the POSIX cast only for validation.

Co-authored-by: OpenAI Codex <codex@openai.com>
@ai-dynamo ai-dynamo deleted a comment from AniketKul May 5, 2026
@ai-dynamo ai-dynamo deleted a comment from AniketKul May 5, 2026
Comment thread test/unit/plugins/posix/nixl_posix_test.cpp Outdated
aknvda added a commit that referenced this pull request May 26, 2026
The POSIX backend fallback fix is split into PR #1605, so keep this branch scoped to the LIBFABRIC postXfer thread-pool feature.

Co-authored-by: OpenAI Codex <codex@openai.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/unit/plugins/posix/meson.build`:
- Around line 19-24: The meson build only emits HAVE_LINUXAIO and HAVE_LIBURING,
so add the POSIX AIO compile define where posix_test_compile_defs is built:
check the has_posix_aio build variable and append a define like
'-DHAVE_POSIXAIO' alongside the existing has_linux_aio / has_io_uring branches
so nixl_posix_test.cpp can detect POSIXAIO as a supported default queue; apply
the same change to the other occurrence noted around the second block (the one
at 29-29).

In `@test/unit/plugins/posix/nixl_posix_test.cpp`:
- Around line 85-87: The changed blocks have clang-format/style violations;
reformat the modified else/print blocks (e.g., the block that prints "this build
does not include Linux AIO or io_uring support." and the nearby std::endl/stream
output lines) and the other similar regions noted by running the repository's
clang-format configuration or applying the project's formatting script so
spacing, indentation, and line breaks match repo style; ensure you update the
affected statements in nixl_posix_test.cpp (the std::cout/std::cerr + std::endl
lines) to conform to the formatter and re-run CI locally to verify formatting
passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 02649959-ea59-4880-9214-19fa50222221

📥 Commits

Reviewing files that changed from the base of the PR and between 77635d6 and 9741147.

📒 Files selected for processing (2)
  • test/unit/plugins/posix/meson.build
  • test/unit/plugins/posix/nixl_posix_test.cpp

Comment thread test/unit/plugins/posix/meson.build
Comment thread test/unit/plugins/posix/nixl_posix_test.cpp
Teach the POSIX plugin test which queue implementations were compiled in and reject POSIX AIO-only or unavailable io_uring configurations at startup with a clear error message.

Co-authored-by: OpenAI Codex <codex@openai.com>
@aknvda aknvda force-pushed the aniket/posix-backend-fixes branch from 9741147 to 87b9a84 Compare May 26, 2026 22:48
ofer
ofer previously approved these changes May 27, 2026
@aranadive
Copy link
Copy Markdown
Contributor

/build

aranadive
aranadive previously approved these changes May 27, 2026
@aranadive aranadive enabled auto-merge (squash) May 27, 2026 08:02
@ofer
Copy link
Copy Markdown

ofer commented May 27, 2026

/build

Modified to reflect different slurm server for CI

Signed-off-by: Ofer Achler <ofer.achler@gmail.com>
@ofer ofer dismissed stale reviews from aranadive and themself via f391bad May 27, 2026 18:37
@ofer ofer requested a review from a team as a code owner May 27, 2026 18:37
Comment thread src/plugins/posix/posix_backend.cpp Outdated
Comment thread src/plugins/posix/posix_backend.cpp Outdated
Reverted the head node back to the general cluster

Signed-off-by: Ofer Achler <ofer.achler@gmail.com>
@ofer
Copy link
Copy Markdown

ofer commented May 28, 2026

/ok to test 3a8d073

@ofer
Copy link
Copy Markdown

ofer commented May 28, 2026

/build

1 similar comment
@Alexey-Rivkin
Copy link
Copy Markdown
Contributor

/build

Avoid RTTI in POSIX request-handle paths and mark the defensive missing I/O queue check as unlikely.

Co-authored-by: OpenAI Codex <codex@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants