Skip to content

Conversation

@usrihari123
Copy link
Contributor

Motivation

Technical Details

This commit removes the 6-bit agent encoding from counter IDs that caused fatal failures in CPX mode when agent logical_node_id exceeded 62. CPX mode (Compute Partitioning) can create 8+ partitions per GPU, easily exceeding
the 63-agent limit in multi-GPU systems.

JIRA ID

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings January 7, 2026 16:29
@usrihari123 usrihari123 requested review from a team as code owners January 7, 2026 16:29
@usrihari123 usrihari123 marked this pull request as draft January 7, 2026 16:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request attempts to fix crashes in CPX (Compute Partitioning) mode by removing the 6-bit agent encoding from counter IDs that was limited to 63 agents. However, the implementation introduces critical bugs that break multi-GPU systems.

Key Changes:

  • Removed agent encoding functions and constants from counter ID structure
  • Replaced agent encoding with a global mapping from base_metric_id to agent_id
  • Updated all counter ID creation/decoding functions to use only base metric IDs

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
id_decode.hpp Removes agent encoding function declarations and constants; updates documentation to reflect removal
id_decode.cpp Removes agent encoding/decoding function implementations
metrics.cpp Simplifies counter ID creation to use only base metric IDs without agent encoding
counters.cpp Introduces global mapping approach to associate metric IDs with agents (contains critical bugs)
sample_processing.cpp Removes loop that set agent encoding in output records
evaluate_ast.cpp Removes DIMENSION_AGENT setting with agent logical_node_id

Critical Issues Found:

  1. Multi-agent collision bug: The global mapping counter_agent_map[metric_id] → agent_id assumes each metric ID uniquely maps to one agent. However, metric IDs are architecture-based, so multiple agents with the same architecture (e.g., two gfx90a GPUs) share the same metric IDs. When rocprofiler_iterate_agent_supported_counters is called for multiple agents with the same architecture, the mapping gets overwritten, breaking subsequent API calls like rocprofiler_query_counter_info.

  2. Thread safety: The global std::unordered_map lacks synchronization, causing undefined behavior under concurrent access from multiple threads.

  3. Trailing whitespace: Minor style issue with trailing whitespace on several lines.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +102 to +107
std::unordered_map<uint64_t, rocprofiler_agent_id_t>&
get_counter_agent_map()
{
static auto* map = new std::unordered_map<uint64_t, rocprofiler_agent_id_t>{};
return *map;
}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global mapping from base_metric_id to agent_id has a critical design flaw: when multiple agents share the same architecture (e.g., two GPUs with gfx90a), they will have identical metric IDs. When rocprofiler_iterate_agent_supported_counters is called for each agent, the second call will overwrite the mapping from the first call, causing get_agent_for_counter to always return the last agent that was iterated. This breaks multi-GPU systems where agents share the same architecture.

Additionally, this unordered_map is not thread-safe. Concurrent writes from multiple threads calling rocprofiler_iterate_agent_supported_counters or concurrent read/write access will cause undefined behavior. Consider using a Synchronized wrapper like the pattern at line 67, or using a different approach that doesn't rely on a single global agent per metric ID (since metric IDs are architecture-based, not agent-specific).

Copilot uses AI. Check for mistakes.
Comment on lines +387 to +390

// Store mapping from base metric ID to agent for later lookup
counter_agent_map[metric.id()] = agent_id;

Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapping assumes each metric ID uniquely identifies an agent, but metric IDs are architecture-based. When this function is called for multiple agents with the same architecture (e.g., two gfx90a GPUs), the loop will overwrite the mapping for each metric_id, storing only the last agent. This breaks any subsequent calls to rocprofiler_query_counter_info or rocprofiler_iterate_counter_dimensions, which will retrieve the wrong agent information for counters from earlier agents with the same architecture.

Suggested change
// Store mapping from base metric ID to agent for later lookup
counter_agent_map[metric.id()] = agent_id;
// Store mapping from base metric ID to agent for later lookup.
// Avoid overwriting an existing mapping when multiple agents share
// the same architecture-based metric ID.
if(counter_agent_map.find(metric.id()) == counter_agent_map.end())
{
counter_agent_map[metric.id()] = agent_id;
}

Copilot uses AI. Check for mistakes.
Comment on lines +115 to +120

if(auto it = map.find(base_metric_id); it != map.end())
{
return it->second;
}

Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 115 and 120 contain trailing whitespace. Consider removing it to maintain code cleanliness and avoid unnecessary diff noise in version control.

Suggested change
if(auto it = map.find(base_metric_id); it != map.end())
{
return it->second;
}
if(auto it = map.find(base_metric_id); it != map.end())
{
return it->second;
}

Copilot uses AI. Check for mistakes.
Comment on lines +387 to +390

// Store mapping from base metric ID to agent for later lookup
counter_agent_map[metric.id()] = agent_id;

Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 387 and 390 contain trailing whitespace. Consider removing it to maintain code cleanliness and avoid unnecessary diff noise in version control.

Suggested change
// Store mapping from base metric ID to agent for later lookup
counter_agent_map[metric.id()] = agent_id;
// Store mapping from base metric ID to agent for later lookup
counter_agent_map[metric.id()] = agent_id;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants