-
Notifications
You must be signed in to change notification settings - Fork 214
Description
Summary
When analyzing PyTorch profiler traces from end-to-end workloads (vLLM, SGLang, etc.), aiter's custom communication operations (allreduce, reduce_scatter, all_gather, etc.) lack the profiler metadata that PyTorch's native NCCL collectives provide.
Background
PyTorch's ProcessGroupNCCL uses the RECORD_PARAM_COMMS_DATA macro to instrument collective operations. This creates record_param_comms events in profiler traces that include valuable metadata:
- Collective name (allreduce, reduce_scatter, etc.)
- Rank and world size
- Input/output tensor sizes (nelems)
- Data type
- Process group information
This metadata is essential for performance analysis in the context of end-to-end workloads like vLLM and SGLang
Proposed Solution
Add RECORD_PARAM_COMMS_DATA instrumentation to aiter's communication functions, similar to PyTorch's implementation in ProcessGroupNCCL.cpp.
Example implementation (see linked PR for all_reduce): #1944
Note: This is just a reference example to demonstrate the approach. The aiter team knows the codebase best and should implement this in whatever way makes sense for the project.
Verification
With instrumentation added, profiler traces show record_param_comms events:
{
"name": "record_param_comms",
"args": {
"Collective name": "allreduce",
"Process Group Name": "aiter_custom_ar",
"Rank": 0,
"Group size": 2,
"In msg nelems": 1048576,
"Out msg nelems": 1048576,
"dtype": "Half"
}
}References
PyTorch implementation (pinned to commit a3bda4952b315b33297ea901a7c55c344ac7ff4c):
- Macro definition:
ParamCommsUtils.hpp#L136-L179 - Usage in allreduce:
ProcessGroupNCCL.cpp#L4508-L4524 - Example PR with working implementation (just an example for
all_reduce, not a complete implementation): #XXX
Operating System
No response
GPU
No response
ROCm Component
No response