Skip to content

[release/cvs-0.2.0] verify_lib: surface oversubscription as WARN, not silent#179

Open
speriaswamy-amd wants to merge 1 commit into
release/cvs-0.2.0from
cherry-pick/cvs-0.2.0/oversubscription-warn
Open

[release/cvs-0.2.0] verify_lib: surface oversubscription as WARN, not silent#179
speriaswamy-amd wants to merge 1 commit into
release/cvs-0.2.0from
cherry-pick/cvs-0.2.0/oversubscription-warn

Conversation

@speriaswamy-amd
Copy link
Copy Markdown
Contributor

Summary

Follow-up on #170 (already merged). #170 dropped the amdgpu runlist oversubscription patterns from the failure regex entirely, which made affected runs PASS but silently discarded a real perf signal. Per AMD docs, Runlist is getting oversubscribed / Expect reduced ROCm performance indicate the HW scheduler is round-robining queues and inactive queues can block the GPU for ms-scale windows — the collective completes correctly, but the perf numbers from that run shouldn't be trusted blindly for regression comparisons.

This PR demotes those patterns to a non-fatal WARN bucket:

  • New module-level warn_patterns_dict in cvs/lib/verify_lib.py alongside err_patterns_dict.
  • verify_dmesg_for_errors now scans both. Err matches still call fail_test (no behavior change). Warn matches emit log.warning(...) with a "perf numbers from this run may not be trustworthy" note — the test still passes, but the run carries a visible WARN line in std.log.
  • Backward-compatible: verify_dmesg_for_errors signature and return shape unchanged. The 7+ existing callers (rccl_perf, rccl_regression, ib_perf, sglang, megatron, jax, inference) are untouched.

Mirror of #169 against main.

Test plan

  • Syntax check passes.
  • Smoke-run rccl_perf from this branch on the validation cluster; confirm a WARN line is emitted on a known-oversubscribed config without failing the test.

Refs: AIMVT-175

Made with Cursor

Per AMD docs the runlist/VM-context oversubscription messages are real
perf-degrading conditions (HW scheduler round-robins queues, blocking the
GPU for ms-scale windows), not benign info. Earlier patch dropped them
from the failure regex entirely, which suppressed the signal: a perf run
in a degraded HW state would now be marked PASS and its bandwidth deltas
silently trusted.

Add a warn_patterns_dict alongside err_patterns_dict; verify_dmesg_for_errors
now scans both. Err matches still fail_test (no behavior change). Warn matches
emit log.warning with a "perf numbers may not be trustworthy" note so the
run is visibly flagged in std.log without failing the collective itself.

Backward compatible: function signature and return shape unchanged, all
other callers untouched.

Ref: https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/conceptual/oversubscription.html
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant