Skip to content

[Feature]: Add record_param_comms instrumentation to communication kernels for profiler metadata #1946

@ajassani

Description

@ajassani

Summary

When analyzing PyTorch profiler traces from end-to-end workloads (vLLM, SGLang, etc.), aiter's custom communication operations (allreduce, reduce_scatter, all_gather, etc.) lack the profiler metadata that PyTorch's native NCCL collectives provide.

Background

PyTorch's ProcessGroupNCCL uses the RECORD_PARAM_COMMS_DATA macro to instrument collective operations. This creates record_param_comms events in profiler traces that include valuable metadata:

  • Collective name (allreduce, reduce_scatter, etc.)
  • Rank and world size
  • Input/output tensor sizes (nelems)
  • Data type
  • Process group information

This metadata is essential for performance analysis in the context of end-to-end workloads like vLLM and SGLang

Proposed Solution

Add RECORD_PARAM_COMMS_DATA instrumentation to aiter's communication functions, similar to PyTorch's implementation in ProcessGroupNCCL.cpp.

Example implementation (see linked PR for all_reduce): #1944

Note: This is just a reference example to demonstrate the approach. The aiter team knows the codebase best and should implement this in whatever way makes sense for the project.

Verification

With instrumentation added, profiler traces show record_param_comms events:

{
  "name": "record_param_comms",
  "args": {
    "Collective name": "allreduce",
    "Process Group Name": "aiter_custom_ar",
    "Rank": 0,
    "Group size": 2,
    "In msg nelems": 1048576,
    "Out msg nelems": 1048576,
    "dtype": "Half"
  }
}

References

PyTorch implementation (pinned to commit a3bda4952b315b33297ea901a7c55c344ac7ff4c):

Operating System

No response

GPU

No response

ROCm Component

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions