-
Notifications
You must be signed in to change notification settings - Fork 101
Fix CPX mode crash by removing agent encoding from counter IDs #2522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request attempts to fix crashes in CPX (Compute Partitioning) mode by removing the 6-bit agent encoding from counter IDs that was limited to 63 agents. However, the implementation introduces critical bugs that break multi-GPU systems.
Key Changes:
- Removed agent encoding functions and constants from counter ID structure
- Replaced agent encoding with a global mapping from base_metric_id to agent_id
- Updated all counter ID creation/decoding functions to use only base metric IDs
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| id_decode.hpp | Removes agent encoding function declarations and constants; updates documentation to reflect removal |
| id_decode.cpp | Removes agent encoding/decoding function implementations |
| metrics.cpp | Simplifies counter ID creation to use only base metric IDs without agent encoding |
| counters.cpp | Introduces global mapping approach to associate metric IDs with agents (contains critical bugs) |
| sample_processing.cpp | Removes loop that set agent encoding in output records |
| evaluate_ast.cpp | Removes DIMENSION_AGENT setting with agent logical_node_id |
Critical Issues Found:
-
Multi-agent collision bug: The global mapping
counter_agent_map[metric_id] → agent_idassumes each metric ID uniquely maps to one agent. However, metric IDs are architecture-based, so multiple agents with the same architecture (e.g., two gfx90a GPUs) share the same metric IDs. Whenrocprofiler_iterate_agent_supported_countersis called for multiple agents with the same architecture, the mapping gets overwritten, breaking subsequent API calls likerocprofiler_query_counter_info. -
Thread safety: The global
std::unordered_maplacks synchronization, causing undefined behavior under concurrent access from multiple threads. -
Trailing whitespace: Minor style issue with trailing whitespace on several lines.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::unordered_map<uint64_t, rocprofiler_agent_id_t>& | ||
| get_counter_agent_map() | ||
| { | ||
| static auto* map = new std::unordered_map<uint64_t, rocprofiler_agent_id_t>{}; | ||
| return *map; | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global mapping from base_metric_id to agent_id has a critical design flaw: when multiple agents share the same architecture (e.g., two GPUs with gfx90a), they will have identical metric IDs. When rocprofiler_iterate_agent_supported_counters is called for each agent, the second call will overwrite the mapping from the first call, causing get_agent_for_counter to always return the last agent that was iterated. This breaks multi-GPU systems where agents share the same architecture.
Additionally, this unordered_map is not thread-safe. Concurrent writes from multiple threads calling rocprofiler_iterate_agent_supported_counters or concurrent read/write access will cause undefined behavior. Consider using a Synchronized wrapper like the pattern at line 67, or using a different approach that doesn't rely on a single global agent per metric ID (since metric IDs are architecture-based, not agent-specific).
|
|
||
| // Store mapping from base metric ID to agent for later lookup | ||
| counter_agent_map[metric.id()] = agent_id; | ||
|
|
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mapping assumes each metric ID uniquely identifies an agent, but metric IDs are architecture-based. When this function is called for multiple agents with the same architecture (e.g., two gfx90a GPUs), the loop will overwrite the mapping for each metric_id, storing only the last agent. This breaks any subsequent calls to rocprofiler_query_counter_info or rocprofiler_iterate_counter_dimensions, which will retrieve the wrong agent information for counters from earlier agents with the same architecture.
| // Store mapping from base metric ID to agent for later lookup | |
| counter_agent_map[metric.id()] = agent_id; | |
| // Store mapping from base metric ID to agent for later lookup. | |
| // Avoid overwriting an existing mapping when multiple agents share | |
| // the same architecture-based metric ID. | |
| if(counter_agent_map.find(metric.id()) == counter_agent_map.end()) | |
| { | |
| counter_agent_map[metric.id()] = agent_id; | |
| } |
|
|
||
| if(auto it = map.find(base_metric_id); it != map.end()) | ||
| { | ||
| return it->second; | ||
| } | ||
|
|
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 115 and 120 contain trailing whitespace. Consider removing it to maintain code cleanliness and avoid unnecessary diff noise in version control.
| if(auto it = map.find(base_metric_id); it != map.end()) | |
| { | |
| return it->second; | |
| } | |
| if(auto it = map.find(base_metric_id); it != map.end()) | |
| { | |
| return it->second; | |
| } |
|
|
||
| // Store mapping from base metric ID to agent for later lookup | ||
| counter_agent_map[metric.id()] = agent_id; | ||
|
|
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 387 and 390 contain trailing whitespace. Consider removing it to maintain code cleanliness and avoid unnecessary diff noise in version control.
| // Store mapping from base metric ID to agent for later lookup | |
| counter_agent_map[metric.id()] = agent_id; | |
| // Store mapping from base metric ID to agent for later lookup | |
| counter_agent_map[metric.id()] = agent_id; |
Motivation
Technical Details
This commit removes the 6-bit agent encoding from counter IDs that caused fatal failures in CPX mode when agent logical_node_id exceeded 62. CPX mode (Compute Partitioning) can create 8+ partitions per GPU, easily exceeding
the 63-agent limit in multi-GPU systems.
JIRA ID
Test Plan
Test Result
Submission Checklist