[Feature]: Add `record_param_comms` instrumentation to communication kernels for profiler metadata

## Summary

When analyzing PyTorch profiler traces from end-to-end workloads (vLLM, SGLang, etc.), aiter's custom communication operations (allreduce, reduce_scatter, all_gather, etc.) lack the profiler metadata that PyTorch's native NCCL collectives provide. 


## Background

PyTorch's `ProcessGroupNCCL` uses the `RECORD_PARAM_COMMS_DATA` macro to instrument collective operations. This creates `record_param_comms` events in profiler traces that include valuable metadata:

- Collective name (allreduce, reduce_scatter, etc.)
- Rank and world size
- Input/output tensor sizes (nelems)
- Data type
- Process group information

This metadata is essential for performance analysis in the context of end-to-end workloads like vLLM and SGLang

## Proposed Solution

Add `RECORD_PARAM_COMMS_DATA` instrumentation to aiter's communication functions, similar to PyTorch's implementation in `ProcessGroupNCCL.cpp`.

**Example implementation** (see linked PR for `all_reduce`): https://github.com/ROCm/aiter/pull/1944

> **Note**: This is just a reference example to demonstrate the approach. The aiter team knows the codebase best and should implement this in whatever way makes sense for the project.

## Verification

With instrumentation added, profiler traces show `record_param_comms` events:

```json
{
  "name": "record_param_comms",
  "args": {
    "Collective name": "allreduce",
    "Process Group Name": "aiter_custom_ar",
    "Rank": 0,
    "Group size": 2,
    "In msg nelems": 1048576,
    "Out msg nelems": 1048576,
    "dtype": "Half"
  }
}
```

## References

PyTorch implementation (pinned to commit `a3bda4952b315b33297ea901a7c55c344ac7ff4c`):
- Macro definition: [`ParamCommsUtils.hpp#L136-L179`](https://github.com/pytorch/pytorch/blob/a3bda4952b315b33297ea901a7c55c344ac7ff4c/torch/csrc/distributed/c10d/ParamCommsUtils.hpp#L136-L179)
- Usage in allreduce: [`ProcessGroupNCCL.cpp#L4508-L4524`](https://github.com/pytorch/pytorch/blob/a3bda4952b315b33297ea901a7c55c344ac7ff4c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L4508-L4524)
- Example PR with working implementation (just an example for `all_reduce`, not a complete implementation): #XXX


### Operating System

_No response_

### GPU

_No response_

### ROCm Component

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add `record_param_comms` instrumentation to communication kernels for profiler metadata #1946

Summary

Background

Proposed Solution

Verification

References

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Add record_param_comms instrumentation to communication kernels for profiler metadata #1946

Description

Summary

Background

Proposed Solution

Verification

References

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature]: Add `record_param_comms` instrumentation to communication kernels for profiler metadata #1946