Skip to content

Commit 478ec06

Browse files
kotlarmilosCopilot
andcommitted
[ci-scanner] collapse to single Known Build Error path
Drop the build-break / infra tracking-issue branches and route every actionable failure (test failure, hang, build break) through the same KBE template. Build Analysis matches both shapes via the JSON body, so a separate tracking-issue path added no value and produced issues that were not picked up by the project board. - Hard rule rewritten: every actionable failure becomes a Known Build Error issue; infra-only failures with no stable signature skip emission entirely. - Step 3 reframed as log-extraction guidance only; deadletter and infra-shaped no-helix failures record 'skipped: infra noise — no stable signature' in the tally. - Step 5 collapsed from A/B/C/D/E/F to A/B/C. Branch A now covers test failures and build breaks (stable = >= 2 occurrences in window OR a build break failing all legs of the current build). Branch B carves out build breaks (no muting path for compile errors). Branch C extended to mechanical build-break fixes. - KBE title template adds a third form for build breaks. - Weak signature now skips emission instead of falling through to a tracking issue. - Tracking issue templates (generic + JIT pipeline) removed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 462e28a commit 478ec06

1 file changed

Lines changed: 23 additions & 97 deletions

File tree

.github/workflows/ci-failure-scan.md

Lines changed: 23 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ The agent runs read-only. All writes go through `safe-outputs`.
8686
4. **One area path per issue.** Title each KBE around a single failure shape (assertion text or test family), not a list of pipelines. If a root cause spans multiple area paths, file one KBE per area and cross-link with `Related: dotnet/runtime#<n>`.
8787
5. **No `Mute` / `Muting` in titles.** Use `Skip`, `Disable`, `Suppress`, or `Exclude`.
8888
6. **Every issue and PR title starts with `[ci-scan] `.**
89-
7. **KBEs only for test failures and hangs.** Never for build breaks (Build Analysis cannot match them) or infra failures (no stable signature). Those get tracking issues.
89+
7. **Every actionable failure becomes a `Known Build Error` issue.** Test failures, hangs, AND build breaks all converge on the same KBE template; Build Analysis matches both via the JSON body. Skip emission entirely for: pre-existing issue/PR matches (Step 4.2-4.5), unstable signatures (< 2 occurrences in window with no current-run severity), or true infra noise (agent disconnect, pool offline) where no stable signature can be extracted.
9090
8. **One signature = one outcome.** No duplicate KBEs. No comments on existing KBEs — Build Analysis already counts occurrences in the issue body.
9191
9. **No same-run muting PR.** The KBE issue number is not visible at emit time (no `issues: write`), and the gap between runs is intentional — it forces a human-review window before muting.
9292
10. **All intermediate state under `/tmp/gh-aw/agent/`.** Each bash invocation is a fresh subshell; persist anything you want to keep.
@@ -102,7 +102,6 @@ For every actionable failure, converge on these artifacts:
102102
| Known Build Error issue | First run that sees the failure | Yes |
103103
| Muting PR | First run that finds the KBE already exists | No — intentional next-run cadence |
104104
| Fix PR (optional) | Same run as the muting PR, when the fix fits the small-fix bounds | Same run as muting PR |
105-
| Tracking issue (build break / infra / no stable signature) | First run that sees the failure | Yes |
106105

107106
The `.NET Core Engineering Services: Known Build Errors` org project (`https://github.com/orgs/dotnet/projects/111`) is populated by `net-helix[bot]` automation that watches `dotnet/runtime` for the `Known Build Error` label and adds matching issues to the project within seconds. Build Analysis reads from the project. The only thing this workflow has to do for project linkage is apply the `Known Build Error` label on the KBE; do NOT try to mutate the project from this workflow.
108107

@@ -161,17 +160,17 @@ For each row in the pipeline table below, in order:
161160
| runtime-interpreter | 316 | ADO name differs from display name |
162161
| runtime-libraries-interpreter | 330 | ADO name differs from display name |
163162

164-
### Step 3 — Classify each failure
163+
### Step 3 — Classify each failure (log-extraction only)
165164

166-
Decide the class of every failed timeline record before passing it to Step 4. The timeline graph is `Stage -> Phase -> Job -> Task`; walk it via `parentId`. Drill into one representative console log per signature to confirm the shape.
165+
Classification here drives WHERE the agent reads the signature text from. It does NOT drive WHERE the issue gets filed — every actionable signature flows through Step 4 + Step 5 Branch A. The timeline graph is `Stage -> Phase -> Job -> Task`; walk it via `parentId`. Drill into one representative console log per signature to confirm the shape.
167166

168-
1. **Build break.** Failed task is `Build product` / `Build native components` / `Configure CMake` / any pre-test compile step, AND `Send to Helix` is `skipped`. -> Step 5 Branch D (tracking issue). Do NOT file a KBE.
169-
2. **Phase/Stage-only failure with no failed Job underneath.** Compile breaks aggregated at phase level (e.g. `windows-arm64 checked` on JIT stress pipelines). Open the Phase log + the latest log of any non-succeeded child Task -> classify as build break.
170-
3. **Helix work-item failure.** `Send to Helix` succeeded but Job still failed. Extract Helix job IDs from the `Send to Helix` log (`Sent Helix Job: <GUID>`), query Helix work items, fetch the failing console log, locate the `[FAIL]` line -> Step 4 (test failure).
171-
4. **Dead-lettered Helix work item.** Console URI contains `helix-workitem-deadletter` -> Step 5 Branch E (grouped infra issue).
172-
5. **Infra-shaped Job failure with no Helix work items.** `Initialize job` failed / agent disconnect / `Pool is offline` -> Step 5 Branch E.
167+
1. **Build break.** Failed task is `Build product` / `Build native components` / `Configure CMake` / any pre-test compile step, AND `Send to Helix` is `skipped`. Read the signature from the failing compile task log (CSxxxx / linker error / cmake error line).
168+
2. **Phase/Stage-only failure with no failed Job underneath.** Compile breaks aggregated at phase level (e.g. `windows-arm64 checked` on JIT stress pipelines). Open the Phase log + the latest log of any non-succeeded child Task and treat as build break.
169+
3. **Helix work-item failure.** `Send to Helix` succeeded but Job still failed. Extract Helix job IDs from the `Send to Helix` log (`Sent Helix Job: <GUID>`), query Helix work items, fetch the failing console log, locate the `[FAIL]` line.
170+
4. **Dead-lettered Helix work item.** Console URI contains `helix-workitem-deadletter`. Extract `[FAIL]` line if present; if not, treat as infra noise (no stable signature) and skip emission entirely — record `skipped: infra noise — no stable signature` in the tally.
171+
5. **Infra-shaped Job failure with no Helix work items.** `Initialize job` failed / agent disconnect / `Pool is offline`. Skip emission entirely — record `skipped: infra noise — no stable signature` in the tally.
173172

174-
For each Step 4 candidate, compute the signature tuple `(definition_id, work_item_or_phase, queue, stress_mode, [FAIL]-or-compile-error signature)`. Look back ~10 prior completed builds in the same definition for first-seen-in-window timestamp and occurrence count.
173+
For each (1)/(2)/(3) signature, compute the tuple `(definition_id, work_item_or_phase, queue, stress_mode, [FAIL]-or-compile-error signature)`. Look back ~10 prior completed builds in the same definition for first-seen-in-window timestamp and occurrence count.
175174

176175
#### Data sources
177176

@@ -240,35 +239,25 @@ Optional fifth check when the candidate KBE is older than ~14 days: confirm Buil
240239

241240
### Step 5 — Decide and emit
242241

243-
Exactly one of Branch A / B / D / E / F fires per signature. Branch C is an additive refinement of Branch B (Branch B's outputs are still emitted, plus an additional small-fix PR).
242+
Exactly one of Branch A / B fires per signature. Branch C is an additive refinement of Branch B (Branch B's outputs are still emitted, plus an additional small-fix PR). Signatures that do not match any branch get `skipped: <reason>` in the tally and emit nothing.
244243

245-
**Branch A — No existing KBE; test failure; signature is stable (>= 2 occurrences in window).**
244+
**Branch A — No existing KBE; signature is stable.**
246245

247-
Emit one `create_issue` using the KBE template. Apply the `Known Build Error` label so the org project auto-add rule picks it up; do NOT try to mutate the project from this workflow.
246+
Stable means >= 2 occurrences in the ~10-build window, OR a build break that fails all legs of the current build (block-everyone severity that warrants filing on first sight). Emit one `create_issue` using the KBE template. Apply both `Known Build Error` and `blocking-clean-ci` labels so the org project auto-add rule picks it up; do NOT try to mutate the project from this workflow.
248247

249248
If Step 4.3 found a tracker, cross-link as `Tracking: dotnet/runtime#<tracker>` in the KBE body. Muting PR is deferred to the next run.
250249

251250
**Branch B — Existing KBE; no muting PR; muting is welcome (Step 4.7 clean).**
252251

253252
Emit one `create_pull_request` using the Muting PR template. Diff <= 5 lines; only test annotations or csproj flags. Body MUST include `Linked KBE: #<n>` as a top-level line plus the Step 4.8 four-question block.
254253

255-
**Branch C — Refinement of Branch B when the failure satisfies the small-fix bounds.**
256-
257-
Small-fix bounds: <= 20 lines, single file, non-API, non-JIT-codegen, non-GC, non-threading, non-security; the failing test verifies the fix.
258-
259-
In addition to the Branch B muting PR, emit a separate `create_pull_request` for the fix on its own branch. Body cites (a) failing test as evidence, (b) root cause, (c) why fix is safe, (d) `Linked KBE: #<n>`, (e) "If this lands before #<muting-PR>, that PR can be closed."
260-
261-
**Branch D — Build break.**
262-
263-
Emit one `create_issue` using the Tracking issue template. NO `Known Build Error` label. Reference the failing source file and compile error. If the fix is mechanical and fits the small-fix bounds (obvious typo, missing `#if`, wrong cast, missing `using`), also emit one `create_pull_request` for the fix.
264-
265-
**Branch E — Infra failure cluster.**
254+
Build-break KBEs are not "muted" — there is no test annotation that can skip a compile error. Skip Branch B for build-break signatures (record `skipped: build break — no muting path` in the tally) and rely on Branch C (small-fix PR) when the fix is mechanical, or on the area owner otherwise.
266255

267-
Group all infra failures in this run into ONE tracking issue. Before emitting, `search_issues` for an open issue whose title or body matches the same failure signature; on hit, skip silently (no duplicate, no comment).
256+
**Branch C — Refinement of Branch B when the failure satisfies the small-fix bounds.**
268257

269-
**Branch F — Anything else (no stable signature, multi-assembly cluster, product regression, native crash, JIT/GC product bug).**
258+
Small-fix bounds: <= 20 lines, single file, non-API, non-JIT-codegen, non-GC, non-threading, non-security; the failing test (or compile error) verifies the fix.
270259

271-
Emit one `create_issue` using the Tracking issue template (or the JIT pipeline template for JIT/GC/PGO/stress pipelines). Call out the signature problem in `Recommended action`.
260+
In addition to the Branch B muting PR (test failures) or directly against the existing KBE (build breaks), emit a separate `create_pull_request` for the fix on its own branch. Build-break fixes are limited to obvious mechanical changes (typo, missing `#if`, wrong cast, missing `using`). Body cites (a) failing test or compile error as evidence, (b) root cause, (c) why fix is safe, (d) `Linked KBE: #<n>`, (e) "If this lands before #<muting-PR>, that PR can be closed." (omit (e) for build-break fixes).
272261

273262
After emitting, record the outcome per signature (Step 6).
274263

@@ -282,7 +271,7 @@ Per signature, append one outcome line to `/tmp/gh-aw/agent/coverage/<pipeline>.
282271

283272
`<outcome>` is one of: `filed-issue #aw_<id>`, `filed-PR #aw_<id>`, `existing-issue #<n>`, `existing-PR #<n>`, `skipped: <reason>`.
284273

285-
A skipped signature MUST have a reason (e.g., `build canceled`, `< 2 occurrences and not blocking`, `do-not-mute on issue #<n>`, `cap reached`).
274+
A skipped signature MUST have a reason (e.g., `build canceled`, `< 2 occurrences and not blocking`, `do-not-mute on issue #<n>`, `cap reached`, `infra noise — no stable signature`, `build break — no muting path`).
286275

287276
At end of run, print this table to the agent log:
288277

@@ -296,9 +285,12 @@ Emit each template verbatim except for `<placeholder>` slots. Match headings exa
296285

297286
### Template: KBE issue body — literal substring match (default)
298287

299-
Title: `[ci-scan] Test failure: <fully.qualified.TestName>` (test failures) or `[ci-scan] Hang: <fully.qualified.TestName>` (hangs/timeouts). Labels: `Known Build Error`, `blocking-clean-ci`.
288+
Title (pick the form matching the signature):
289+
- `[ci-scan] Test failure: <fully.qualified.TestName>` for test failures
290+
- `[ci-scan] Hang: <fully.qualified.TestName>` for hangs / timeouts
291+
- `[ci-scan] Build break: <short error description>` for compile / link / cmake breaks (the body's `## Error Message` JSON still carries the canonical signature for Build Analysis)
300292

301-
KBEs are reserved for test failures and hangs only (per Hard rule #7). Build breaks and infra failures get tracking issues (Branch D / E) without the `Known Build Error` label.
293+
Labels: `Known Build Error`, `blocking-clean-ci`.
302294

303295
````markdown
304296
## Build Information
@@ -428,7 +420,7 @@ Prefer signatures built from, in order:
428420
3. Unique native stack frame or symbol, e.g. `coreclr!Compiler::fgMorphCall + 0x`.
429421
4. Specific JIT method-being-compiled marker + the specific stress mode.
430422

431-
If you cannot produce a signature meeting this bar -> file a tracking issue instead and call out the signature problem in `Recommended action`. Do NOT file a KBE with a weak signature.
423+
If you cannot produce a signature meeting this bar -> skip emission entirely (record `skipped: weak signature` in the tally). Do NOT file a KBE with a weak signature — it will mismatch in Build Analysis and become noise.
432424

433425
### Template: KBE signature — Bad vs Good
434426

@@ -495,72 +487,6 @@ Scope rule (mandatory): condition must be AS NARROW AS the observed failure scop
495487

496488
In the PR `Reasoning` section, list the exact set of failing legs (definition + queue + stress mode) that justifies the chosen condition.
497489

498-
### Template: Tracking issue body — generic
499-
500-
Used for build breaks (Branch D), infra clusters (Branch E), and non-KBE-eligible failures (Branch F). Title per branch (e.g. `[ci-scan] Build break: <pipeline>`, `[ci-scan] Infra: <shape>`).
501-
502-
````markdown
503-
## Reasoning
504-
<short summary of failure shape; why this isn't a PR-related regression>
505-
506-
## Impact on platforms
507-
- <(pipeline + platform/arch + Helix queue + stress mode + exit code) per occurrence>
508-
509-
## Errors log
510-
```
511-
<sanitized excerpt>
512-
```
513-
514-
## First build it occurred
515-
- Build: <link>
516-
- Finished: <UTC timestamp>
517-
- Commit: <sha>
518-
- Occurrences in window: <n>
519-
520-
## Recommended action
521-
<concrete next step: which area owner, which file likely needs the fix, or what investigation would localize the root cause; checkbox-ready task list, not "FYI">
522-
````
523-
524-
No labels. The labeler bot adds `area-*` automatically.
525-
526-
### Template: Tracking issue body — JIT pipeline
527-
528-
Used for tracking issues against JIT/GC/PGO/stress pipelines (definitions 109–160, 230, 235, 108, 137, 144–145, 150, 153). Matches the in-repo JIT convention.
529-
530-
````markdown
531-
**Summary:**
532-
<one-line description of the failure shape>
533-
534-
**Failed in (<N>):**
535-
- [<pipeline name> <build number>](<build url>)
536-
- [<pipeline name> <build number>](<build url>)
537-
- ...
538-
539-
**Console Log:** [Console Log](<one representative helix console log url>)
540-
541-
**Failed tests:**
542-
```
543-
<pipeline-name-1>
544-
- <leg name e.g. net11.0-windows-Release-x64-jitstress2_jitstressregs8-Windows.10.Amd64.Open>
545-
- <test assembly or test name>
546-
<pipeline-name-2>
547-
- <leg name>
548-
- <test assembly>
549-
```
550-
551-
**Error Message:**
552-
```
553-
<canonical error line>
554-
```
555-
556-
**Stack Trace:**
557-
```
558-
<relevant stack trace; trim noise but keep the failing frame>
559-
```
560-
````
561-
562-
Do NOT propose any `area-*` label yourself. Area triage (`area-CodeGen-coreclr` / `area-GC-coreclr` / `area-PGO-coreclr` / `area-Tools-ILVerification`) is added later by a human reviewer.
563-
564490
### Template: Sanitization
565491

566492
When pasting log excerpts into issue/PR bodies, strip:

0 commit comments

Comments
 (0)