[release/cvs-0.2.0] verify_lib: surface oversubscription as WARN, not silent#179
Open
speriaswamy-amd wants to merge 1 commit into
Open
[release/cvs-0.2.0] verify_lib: surface oversubscription as WARN, not silent#179speriaswamy-amd wants to merge 1 commit into
speriaswamy-amd wants to merge 1 commit into
Conversation
Per AMD docs the runlist/VM-context oversubscription messages are real perf-degrading conditions (HW scheduler round-robins queues, blocking the GPU for ms-scale windows), not benign info. Earlier patch dropped them from the failure regex entirely, which suppressed the signal: a perf run in a degraded HW state would now be marked PASS and its bandwidth deltas silently trusted. Add a warn_patterns_dict alongside err_patterns_dict; verify_dmesg_for_errors now scans both. Err matches still fail_test (no behavior change). Warn matches emit log.warning with a "perf numbers may not be trustworthy" note so the run is visibly flagged in std.log without failing the collective itself. Backward compatible: function signature and return shape unchanged, all other callers untouched. Ref: https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/conceptual/oversubscription.html Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up on #170 (already merged). #170 dropped the amdgpu runlist oversubscription patterns from the failure regex entirely, which made affected runs PASS but silently discarded a real perf signal. Per AMD docs,
Runlist is getting oversubscribed/Expect reduced ROCm performanceindicate the HW scheduler is round-robining queues and inactive queues can block the GPU for ms-scale windows — the collective completes correctly, but the perf numbers from that run shouldn't be trusted blindly for regression comparisons.This PR demotes those patterns to a non-fatal WARN bucket:
warn_patterns_dictincvs/lib/verify_lib.pyalongsideerr_patterns_dict.verify_dmesg_for_errorsnow scans both. Err matches still callfail_test(no behavior change). Warn matches emitlog.warning(...)with a"perf numbers from this run may not be trustworthy"note — the test still passes, but the run carries a visible WARN line instd.log.verify_dmesg_for_errorssignature and return shape unchanged. The 7+ existing callers (rccl_perf, rccl_regression, ib_perf, sglang, megatron, jax, inference) are untouched.Mirror of #169 against
main.Test plan
rccl_perffrom this branch on the validation cluster; confirm a WARN line is emitted on a known-oversubscribed config without failing the test.Refs: AIMVT-175
Made with Cursor