Skip to content

fix: improve vLLM plugin compatibility and NCCL receive handling#109

Merged
chaokunyang merged 1 commit intoinclusionAI:mainfrom
garrett4wade:main
Apr 22, 2026
Merged

fix: improve vLLM plugin compatibility and NCCL receive handling#109
chaokunyang merged 1 commit intoinclusionAI:mainfrom
garrett4wade:main

Conversation

@garrett4wade
Copy link
Copy Markdown
Contributor

Summary

  • Add compatibility paths in awex/vllm_plugin.py for newer vLLM OpenAI protocol/router changes and ensure Awex routes are attached via build_app patching when shared router is absent.
  • Update NCCL reader test flow to apply non-contiguous receive tensor copies, matching production recv behavior.
  • Harden config and process-group utilities (torch version comparison via packaging.version, registry sharding strategy accessor, and formatting/cleanup updates).

@garrett4wade
Copy link
Copy Markdown
Contributor Author

Test results (docker image):
image

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements compatibility for newer vLLM versions by dynamically importing components and patching the build_app function to register Awex routes. It also refines PyTorch version checks and adds support for non-contiguous tensors in weight synchronization tests. Review feedback identifies a security vulnerability regarding unauthenticated endpoints and suggests a more direct method for retrieving the PyTorch version.

Comment thread awex/vllm_plugin.py
Comment on lines 459 to 460
@router.post("/areal_awex_init")
async def awex_init(request: AwexInitRequest, raw_request: Request):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The new endpoints /areal_awex_init and /areal_awex_update are registered without any explicit authentication or authorization dependencies. Since these endpoints can trigger significant state changes (like re-initializing the NCCL group or updating model weights), they could be exploited if the vLLM server is exposed. Consider ensuring these routes are protected by the same security mechanisms (e.g., API key checks) used for the standard OpenAI-compatible endpoints.

pg_options_param_name = (
"backend_options" if str(torch.__version__) >= "2.6" else "pg_options"
"backend_options"
if Version(version("torch")) >= Version("2.6")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using packaging.version.Version correctly fixes the string comparison bug, using importlib.metadata.version("torch") is less direct and potentially less robust than using the __version__ attribute already available on the imported torch module. The metadata query can fail in certain environments (e.g., non-standard installations) even if the module is successfully loaded.

Suggested change
if Version(version("torch")) >= Version("2.6")
if Version(torch.__version__) >= Version("2.6")

@chaokunyang chaokunyang merged commit 45a917b into inclusionAI:main Apr 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants