Skip to content

Conversation

@bwelton
Copy link
Contributor

@bwelton bwelton commented Jan 7, 2026

Summary

Backport of #2469 to release/rocm-rel-7.2

This change adds PTL (Peak Tops Limiter) control support for MI300 GPUs and fixes a counter ID lookup issue in RDC.

rocprofiler-sdk changes:

  • Add KFD_IOC_PROFILER_PTL_CONTROL ioctl operation and kfd_ioctl_ptl_control struct to kfd_ioctl.h for PTL enable/disable control
  • Add ptl_control_supported(), counter_collection_ptl_disable(), and counter_collection_ptl_enable() functions to ioctl.cpp/hpp
  • Integrate PTL disable in configure_dispatch() for dispatch counter collection
  • Add PTL disable on start and PTL enable on stop for device counting service in device_counting.cpp
  • Print strerror(errno) instead of -1 for better error messages

RDC changes:

  • Fix counter ID lookup by using per-instance cache with direct SDK query instead of global map
  • Add mutex for thread-safe id_to_name cache access

Test plan

  • Verify PTL control works on MI300 GPUs
  • Verify counter collection continues to work on older GPUs without PTL support
  • Verify RDC counter ID lookup works correctly

Cherry-picked from: #2469

🤖 Generated with Claude Code

bwelton and others added 4 commits January 7, 2026 11:54
This change adds PTL (Peak Tops Limiter) control support for MI300 GPUs
and fixes a counter ID lookup issue in RDC.

rocprofiler-sdk changes:
- Add KFD_IOC_PROFILER_PTL_CONTROL ioctl operation and kfd_ioctl_ptl_control
  struct to kfd_ioctl.h for PTL enable/disable control
- Add ptl_control_supported(), counter_collection_ptl_disable(), and
  counter_collection_ptl_enable() functions to ioctl.cpp/hpp
- Enable counter_collection_device_unlock() function (was commented out)
- Integrate PTL disable in configure_dispatch() for dispatch counter collection
- Add PTL disable on start and PTL enable on stop for device counting service
  in device_counting.cpp

RDC changes:
- Fix counter ID lookup by using base metric ID (lower 16 bits) instead
  of full handle which includes dimension encoding
- Use global thread-safe counter name map for proper ID-to-name resolution
The PTL control failure is expected on older kernels/hardware that don't
support this feature, so INFO is more appropriate than WARNING.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@JeniferC99 JeniferC99 merged commit 5037e43 into release/rocm-rel-7.2 Jan 8, 2026
26 of 41 checks passed
@JeniferC99 JeniferC99 deleted the bewelton/rdc-rocprof-ptl-7.2 branch January 8, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants