You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ci-scanner] collapse to single Known Build Error path
Drop the build-break / infra tracking-issue branches and route every
actionable failure (test failure, hang, build break) through the same
KBE template. Build Analysis matches both shapes via the JSON body, so
a separate tracking-issue path added no value and produced issues that
were not picked up by the project board.
- Hard rule rewritten: every actionable failure becomes a Known Build
Error issue; infra-only failures with no stable signature skip
emission entirely.
- Step 3 reframed as log-extraction guidance only; deadletter and
infra-shaped no-helix failures record 'skipped: infra noise — no
stable signature' in the tally.
- Step 5 collapsed from A/B/C/D/E/F to A/B/C. Branch A now covers test
failures and build breaks (stable = >= 2 occurrences in window OR a
build break failing all legs of the current build). Branch B carves
out build breaks (no muting path for compile errors). Branch C
extended to mechanical build-break fixes.
- KBE title template adds a third form for build breaks.
- Weak signature now skips emission instead of falling through to a
tracking issue.
- Tracking issue templates (generic + JIT pipeline) removed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy file name to clipboardExpand all lines: .github/workflows/ci-failure-scan.md
+23-97Lines changed: 23 additions & 97 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -86,7 +86,7 @@ The agent runs read-only. All writes go through `safe-outputs`.
86
86
4.**One area path per issue.** Title each KBE around a single failure shape (assertion text or test family), not a list of pipelines. If a root cause spans multiple area paths, file one KBE per area and cross-link with `Related: dotnet/runtime#<n>`.
87
87
5.**No `Mute` / `Muting` in titles.** Use `Skip`, `Disable`, `Suppress`, or `Exclude`.
88
88
6.**Every issue and PR title starts with `[ci-scan] `.**
89
-
7.**KBEs only for test failures and hangs.**Never for build breaks (Build Analysis cannot match them) or infra failures (no stable signature). Those get tracking issues.
89
+
7.**Every actionable failure becomes a `Known Build Error` issue.**Test failures, hangs, AND build breaks all converge on the same KBE template; Build Analysis matches both via the JSON body. Skip emission entirely for: pre-existing issue/PR matches (Step 4.2-4.5), unstable signatures (< 2 occurrences in window with no current-run severity), or true infra noise (agent disconnect, pool offline) where no stable signature can be extracted.
90
90
8.**One signature = one outcome.** No duplicate KBEs. No comments on existing KBEs — Build Analysis already counts occurrences in the issue body.
91
91
9.**No same-run muting PR.** The KBE issue number is not visible at emit time (no `issues: write`), and the gap between runs is intentional — it forces a human-review window before muting.
92
92
10.**All intermediate state under `/tmp/gh-aw/agent/`.** Each bash invocation is a fresh subshell; persist anything you want to keep.
@@ -102,7 +102,6 @@ For every actionable failure, converge on these artifacts:
102
102
| Known Build Error issue | First run that sees the failure | Yes |
103
103
| Muting PR | First run that finds the KBE already exists | No — intentional next-run cadence |
104
104
| Fix PR (optional) | Same run as the muting PR, when the fix fits the small-fix bounds | Same run as muting PR |
105
-
| Tracking issue (build break / infra / no stable signature) | First run that sees the failure | Yes |
106
105
107
106
The `.NET Core Engineering Services: Known Build Errors` org project (`https://github.com/orgs/dotnet/projects/111`) is populated by `net-helix[bot]` automation that watches `dotnet/runtime` for the `Known Build Error` label and adds matching issues to the project within seconds. Build Analysis reads from the project. The only thing this workflow has to do for project linkage is apply the `Known Build Error` label on the KBE; do NOT try to mutate the project from this workflow.
108
107
@@ -161,17 +160,17 @@ For each row in the pipeline table below, in order:
161
160
| runtime-interpreter | 316 | ADO name differs from display name |
162
161
| runtime-libraries-interpreter | 330 | ADO name differs from display name |
163
162
164
-
### Step 3 — Classify each failure
163
+
### Step 3 — Classify each failure (log-extraction only)
165
164
166
-
Decide the class of every failed timeline record before passing it to Step 4. The timeline graph is `Stage -> Phase -> Job -> Task`; walk it via `parentId`. Drill into one representative console log per signature to confirm the shape.
165
+
Classification here drives WHERE the agent reads the signature text from. It does NOT drive WHERE the issue gets filed — every actionable signature flows through Step 4 + Step 5 Branch A. The timeline graph is `Stage -> Phase -> Job -> Task`; walk it via `parentId`. Drill into one representative console log per signature to confirm the shape.
167
166
168
-
1.**Build break.** Failed task is `Build product` / `Build native components` / `Configure CMake` / any pre-test compile step, AND `Send to Helix` is `skipped`. -> Step 5 Branch D (tracking issue). Do NOT file a KBE.
169
-
2.**Phase/Stage-only failure with no failed Job underneath.** Compile breaks aggregated at phase level (e.g. `windows-arm64 checked` on JIT stress pipelines). Open the Phase log + the latest log of any non-succeeded child Task -> classify as build break.
170
-
3.**Helix work-item failure.**`Send to Helix` succeeded but Job still failed. Extract Helix job IDs from the `Send to Helix` log (`Sent Helix Job: <GUID>`), query Helix work items, fetch the failing console log, locate the `[FAIL]` line -> Step 4 (test failure).
171
-
4.**Dead-lettered Helix work item.** Console URI contains `helix-workitem-deadletter` -> Step 5 Branch E (grouped infra issue).
172
-
5.**Infra-shaped Job failure with no Helix work items.**`Initialize job` failed / agent disconnect / `Pool is offline` -> Step 5 Branch E.
167
+
1.**Build break.** Failed task is `Build product` / `Build native components` / `Configure CMake` / any pre-test compile step, AND `Send to Helix` is `skipped`. Read the signature from the failing compile task log (CSxxxx / linker error / cmake error line).
168
+
2.**Phase/Stage-only failure with no failed Job underneath.** Compile breaks aggregated at phase level (e.g. `windows-arm64 checked` on JIT stress pipelines). Open the Phase log + the latest log of any non-succeeded child Task and treat as build break.
169
+
3.**Helix work-item failure.**`Send to Helix` succeeded but Job still failed. Extract Helix job IDs from the `Send to Helix` log (`Sent Helix Job: <GUID>`), query Helix work items, fetch the failing console log, locate the `[FAIL]` line.
170
+
4.**Dead-lettered Helix work item.** Console URI contains `helix-workitem-deadletter`. Extract `[FAIL]` line if present; if not, treat as infra noise (no stable signature) and skip emission entirely — record `skipped: infra noise — no stable signature` in the tally.
171
+
5.**Infra-shaped Job failure with no Helix work items.**`Initialize job` failed / agent disconnect / `Pool is offline`. Skip emission entirely — record `skipped: infra noise — no stable signature` in the tally.
173
172
174
-
For each Step 4 candidate, compute the signature tuple `(definition_id, work_item_or_phase, queue, stress_mode, [FAIL]-or-compile-error signature)`. Look back ~10 prior completed builds in the same definition for first-seen-in-window timestamp and occurrence count.
173
+
For each (1)/(2)/(3) signature, compute the tuple `(definition_id, work_item_or_phase, queue, stress_mode, [FAIL]-or-compile-error signature)`. Look back ~10 prior completed builds in the same definition for first-seen-in-window timestamp and occurrence count.
175
174
176
175
#### Data sources
177
176
@@ -240,35 +239,25 @@ Optional fifth check when the candidate KBE is older than ~14 days: confirm Buil
240
239
241
240
### Step 5 — Decide and emit
242
241
243
-
Exactly one of Branch A / B / D / E / F fires per signature. Branch C is an additive refinement of Branch B (Branch B's outputs are still emitted, plus an additional small-fix PR).
242
+
Exactly one of Branch A / B fires per signature. Branch C is an additive refinement of Branch B (Branch B's outputs are still emitted, plus an additional small-fix PR). Signatures that do not match any branch get `skipped: <reason>` in the tally and emit nothing.
244
243
245
-
**Branch A — No existing KBE; test failure; signature is stable (>= 2 occurrences in window).**
244
+
**Branch A — No existing KBE; signature is stable.**
246
245
247
-
Emit one `create_issue` using the KBE template. Apply the`Known Build Error`label so the org project auto-add rule picks it up; do NOT try to mutate the project from this workflow.
246
+
Stable means >= 2 occurrences in the ~10-build window, OR a build break that fails all legs of the current build (block-everyone severity that warrants filing on first sight). Emit one `create_issue` using the KBE template. Apply both`Known Build Error`and `blocking-clean-ci` labels so the org project auto-add rule picks it up; do NOT try to mutate the project from this workflow.
248
247
249
248
If Step 4.3 found a tracker, cross-link as `Tracking: dotnet/runtime#<tracker>` in the KBE body. Muting PR is deferred to the next run.
250
249
251
250
**Branch B — Existing KBE; no muting PR; muting is welcome (Step 4.7 clean).**
252
251
253
252
Emit one `create_pull_request` using the Muting PR template. Diff <= 5 lines; only test annotations or csproj flags. Body MUST include `Linked KBE: #<n>` as a top-level line plus the Step 4.8 four-question block.
254
253
255
-
**Branch C — Refinement of Branch B when the failure satisfies the small-fix bounds.**
256
-
257
-
Small-fix bounds: <= 20 lines, single file, non-API, non-JIT-codegen, non-GC, non-threading, non-security; the failing test verifies the fix.
258
-
259
-
In addition to the Branch B muting PR, emit a separate `create_pull_request` for the fix on its own branch. Body cites (a) failing test as evidence, (b) root cause, (c) why fix is safe, (d) `Linked KBE: #<n>`, (e) "If this lands before #<muting-PR>, that PR can be closed."
260
-
261
-
**Branch D — Build break.**
262
-
263
-
Emit one `create_issue` using the Tracking issue template. NO `Known Build Error` label. Reference the failing source file and compile error. If the fix is mechanical and fits the small-fix bounds (obvious typo, missing `#if`, wrong cast, missing `using`), also emit one `create_pull_request` for the fix.
264
-
265
-
**Branch E — Infra failure cluster.**
254
+
Build-break KBEs are not "muted" — there is no test annotation that can skip a compile error. Skip Branch B for build-break signatures (record `skipped: build break — no muting path` in the tally) and rely on Branch C (small-fix PR) when the fix is mechanical, or on the area owner otherwise.
266
255
267
-
Group all infra failures in this run into ONE tracking issue. Before emitting, `search_issues` for an open issue whose title or body matches the same failure signature; on hit, skip silently (no duplicate, no comment).
256
+
**Branch C — Refinement of Branch B when the failure satisfies the small-fix bounds.**
268
257
269
-
**Branch F — Anything else (no stable signature, multi-assembly cluster, product regression, native crash, JIT/GC product bug).**
258
+
Small-fix bounds: <= 20 lines, single file, non-API, non-JIT-codegen, non-GC, non-threading, non-security; the failing test (or compile error) verifies the fix.
270
259
271
-
Emit one `create_issue` using the Tracking issue template (or the JIT pipeline template for JIT/GC/PGO/stress pipelines). Call out the signature problem in `Recommended action`.
260
+
In addition to the Branch B muting PR (test failures) or directly against the existing KBE (build breaks), emit a separate `create_pull_request`for the fix on its own branch. Build-break fixes are limited to obvious mechanical changes (typo, missing `#if`, wrong cast, missing `using`). Body cites (a) failing test or compile error as evidence, (b) root cause, (c) why fix is safe, (d) `Linked KBE: #<n>`, (e) "If this lands before #<muting-PR>, that PR can be closed." (omit (e) for build-break fixes).
272
261
273
262
After emitting, record the outcome per signature (Step 6).
274
263
@@ -282,7 +271,7 @@ Per signature, append one outcome line to `/tmp/gh-aw/agent/coverage/<pipeline>.
282
271
283
272
`<outcome>` is one of: `filed-issue #aw_<id>`, `filed-PR #aw_<id>`, `existing-issue #<n>`, `existing-PR #<n>`, `skipped: <reason>`.
284
273
285
-
A skipped signature MUST have a reason (e.g., `build canceled`, `< 2 occurrences and not blocking`, `do-not-mute on issue #<n>`, `cap reached`).
274
+
A skipped signature MUST have a reason (e.g., `build canceled`, `< 2 occurrences and not blocking`, `do-not-mute on issue #<n>`, `cap reached`, `infra noise — no stable signature`, `build break — no muting path`).
286
275
287
276
At end of run, print this table to the agent log:
288
277
@@ -296,9 +285,12 @@ Emit each template verbatim except for `<placeholder>` slots. Match headings exa
296
285
297
286
### Template: KBE issue body — literal substring match (default)
298
287
299
-
Title: `[ci-scan] Test failure: <fully.qualified.TestName>` (test failures) or `[ci-scan] Hang: <fully.qualified.TestName>` (hangs/timeouts). Labels: `Known Build Error`, `blocking-clean-ci`.
288
+
Title (pick the form matching the signature):
289
+
-`[ci-scan] Test failure: <fully.qualified.TestName>` for test failures
290
+
-`[ci-scan] Hang: <fully.qualified.TestName>` for hangs / timeouts
291
+
-`[ci-scan] Build break: <short error description>` for compile / link / cmake breaks (the body's `## Error Message` JSON still carries the canonical signature for Build Analysis)
300
292
301
-
KBEs are reserved for test failures and hangs only (per Hard rule #7). Build breaks and infra failures get tracking issues (Branch D / E) without the `Known Build Error` label.
293
+
Labels: `Known Build Error`, `blocking-clean-ci`.
302
294
303
295
````markdown
304
296
## Build Information
@@ -428,7 +420,7 @@ Prefer signatures built from, in order:
428
420
3. Unique native stack frame or symbol, e.g. `coreclr!Compiler::fgMorphCall + 0x`.
429
421
4. Specific JIT method-being-compiled marker + the specific stress mode.
430
422
431
-
If you cannot produce a signature meeting this bar -> file a tracking issue instead and call out the signature problem in `Recommended action`. Do NOT file a KBE with a weak signature.
423
+
If you cannot produce a signature meeting this bar -> skip emission entirely (record `skipped: weak signature`in the tally). Do NOT file a KBE with a weak signature — it will mismatch in Build Analysis and become noise.
432
424
433
425
### Template: KBE signature — Bad vs Good
434
426
@@ -495,72 +487,6 @@ Scope rule (mandatory): condition must be AS NARROW AS the observed failure scop
495
487
496
488
In the PR `Reasoning` section, list the exact set of failing legs (definition + queue + stress mode) that justifies the chosen condition.
497
489
498
-
### Template: Tracking issue body — generic
499
-
500
-
Used for build breaks (Branch D), infra clusters (Branch E), and non-KBE-eligible failures (Branch F). Title per branch (e.g. `[ci-scan] Build break: <pipeline>`, `[ci-scan] Infra: <shape>`).
501
-
502
-
````markdown
503
-
## Reasoning
504
-
<short summary of failure shape; why this isn't a PR-related regression>
<concrete next step: which area owner, which file likely needs the fix, or what investigation would localize the root cause; checkbox-ready task list, not "FYI">
522
-
````
523
-
524
-
No labels. The labeler bot adds `area-*` automatically.
525
-
526
-
### Template: Tracking issue body — JIT pipeline
527
-
528
-
Used for tracking issues against JIT/GC/PGO/stress pipelines (definitions 109–160, 230, 235, 108, 137, 144–145, 150, 153). Matches the in-repo JIT convention.
<relevant stack trace; trim noise but keep the failing frame>
559
-
```
560
-
````
561
-
562
-
Do NOT propose any `area-*` label yourself. Area triage (`area-CodeGen-coreclr` / `area-GC-coreclr` / `area-PGO-coreclr` / `area-Tools-ILVerification`) is added later by a human reviewer.
563
-
564
490
### Template: Sanitization
565
491
566
492
When pasting log excerpts into issue/PR bodies, strip:
0 commit comments