chore: agent-readiness pass + alerts.log writer + daemon→run rename#31
chore: agent-readiness pass + alerts.log writer + daemon→run rename#31
Conversation
The workflow ran `go test ./e2e/host/...` against a path deleted in 63a396a. Every nightly cron run would fail. Drop it; the underlying daemon-host test scaffold is gone and re-adding it is out of scope.
- New .git-hooks/pre-push runs `go run ./cmd/verify` (fmt + lint + test + 80% coverage gate) before push. Activate per-clone with `git config core.hooksPath .git-hooks`. - AGENTS.md: bump documented Go version to match .tool-versions (1.26), promote `go run ./cmd/verify` as the canonical gate, drop the misleading `go vet` line, replace the fictional `make install` with the actual build command, document HEALTHD_CONFIG.
- `healthd daemon install` was removed in 63a396a (process-compose now manages the daemon), but the quickstart still listed it. - The macOS LaunchAgent PATH paragraph was tied to that removed install command. - `docs/testing.md` was deleted in 0669eb8 but the link remained. - Removed the optional host-e2e snippet whose target tests were deleted in 63a396a (matching CI workflow removal in fe95f2b).
The TUI's "Recent Alerts" panel reads ~/.local/state/healthd/alerts.log via internal/tui/alerts.go, but nothing in the codebase ever wrote that file — the writer half of the feature shipped in PR #29 was never implemented. This adds a new internal/alertlog package as the canonical writer: - DefaultPath() — single source of truth for the log location - Append() — atomic append, creates parent dir, sanitizes newlines - FormatLine() — produces lines that match alertLinePattern in the TUI The daemon's RunLoop calls alertlog.Append on every emitted transition event, regardless of whether notify backends are configured. Writes are best-effort: errors log to stderr and don't fail the loop. internal/tui/alerts.go now delegates DefaultAlertsLogPath to alertlog so reader and writer can never drift on the path. Daemon coverage test now sets HOME to a tempdir and asserts the expected `[crit] failing` line lands in alerts.log, giving us end-to-end coverage of the writer.
The healthd-operator skill referenced subcommands that were removed in 63a396a — daemon install/status/logs/uninstall no longer exist; only `daemon run` remains. Daemon lifecycle is now owned by an external supervisor (process-compose / systemd / launchd). - SKILL.md: rewrite Step 6 (Daemon install/status/logs) and Step 7 (Rollback) for the foreground+supervisor model; add Step 7 for alert history inspection. Drop launchd-PATH troubleshooting entry (no longer healthd's responsibility) and replace with a generic supervisor-PATH note. Add an alerts.log inspection tip now that the log is actually written. - openclaw-integration.md: same surgery on the OpenClaw flow.
The `daemon` command path was a vestige from when healthd managed its own LaunchAgent lifecycle. That code was removed in 63a396a — supervision is now external (process-compose / systemd / launchd). The remaining "daemon run" subcommand was just running a foreground loop, so the naming became misleading. - internal/daemon/ → internal/loop/; package daemon → package loop; RunLoop → Run; matching test name updates. - cmd/daemon.go → cmd/run.go; removed the parent `daemon` command; exposed `healthd run` at the top level. - ARCHITECTURE.md mermaid diagram and component table updated to use the loop package and `healthd run` invocation. - README, AGENTS.md, operator skill, and openclaw integration doc all updated to reference the new command path. Breaking change for anyone invoking `healthd daemon run` directly — update supervisor configs to `healthd run`.
The total-coverage gate at 80% lets weak new packages slip in. The recently-added alertlog package shipped at 77.8% — fine on the total, but a per-package check would have flagged it for review. Adds a per-package floor (default 70%) that fails the verifier when any tested package with statements drops below the threshold. Skips packages with no statements (e2e/cli, integration). Both thresholds are flag-tunable: --min-coverage and --min-package-coverage.
Adds .github/pull_request_template.md with the structure already practiced in recent PRs: Summary, Changes, Validation, Risks, Linked Issues. The Validation section nudges contributors and agents toward the canonical proof points (verify, pre-push hook, tests, smoke, docs).
The go.mod `go` directive is `1.24.0`, which is the language version contract. `.tool-versions` pins the dev toolchain to 1.26.0 (a newer toolchain compiling code that targets 1.24). AGENTS.md previously said "Go 1.26", conflating the two. Changed to "Go 1.24+" with a parenthetical pointing at .tool-versions for the toolchain pin.
Review surfaced two real risks in the per-package gate added in d2b488f: 1. parsePackageCoverage silently dropped non-matching lines, so a future change to `go test` output format would let the gate vacuously pass ("all packages ≥ 70.0%" while measuring nothing). 2. The function had zero unit tests, despite cmd/verify being excluded from its own coverage gate. - Added a "parsed > 0 when expected > 0" sanity check that fails the verifier loudly if no `ok ... coverage:` lines match. Lines for packages with no _test.go files (no `ok` prefix) are tolerated as expected partial misses. - Added table-driven tests covering normal, cached, no-statements, mixed, partial-match, zero-match, and trailing-whitespace cases. Also normalized status indicators from ❌/✅ to ✗/✓ to match the unicode style established in 7991d2b.
Cleanup reviewer flagged two speculative seams inherited from the old cmd/daemon.go that the rename to cmd/run.go preserved unnecessarily: - `var runLoop = loop.Run` was a stub-point with no test stubbing it (grep confirms zero callers besides the one inline use). Inlined back to `loop.Run(...)`. - `loadConfigForRun` returned the resolved config path, but the only caller discarded it. Dropped the path from the signature. Per project doctrine: extension points are earned by real use-cases, not speculation.
There was a problem hiding this comment.
Pull request overview
This PR updates healthd’s long-running execution model and operational tooling by introducing a canonical alerts.log writer, renaming the daemon loop command/package to reflect supervisor-owned lifecycle, and tightening verification via a per-package coverage gate.
Changes:
- Add
internal/alertlogand wireloop.Runto append an alerts transition line on every state change (independent of notifier config). - Rename
healthd daemon run→healthd runandinternal/daemon→internal/loop; update docs/operator skill accordingly. - Enhance
cmd/verifywith per-package coverage threshold parsing + tests; add pre-push hook and PR template; remove stale macOS host-e2e workflow.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| skills/healthd-operator/references/openclaw-integration.md | Updates operator flow to supervisor-based lifecycle + alerts.log inspection. |
| skills/healthd-operator/SKILL.md | Refreshes skill description/workflow/troubleshooting for healthd run + alerts history. |
| internal/tui/alerts.go | Delegates alerts log path to internal/alertlog to prevent reader/writer drift. |
| internal/loop/runtime.go | Renames run loop and appends alert transitions to alerts.log. |
| internal/loop/runtime_test.go | Updates tests to call loop.Run and new naming. |
| internal/loop/coverage_test.go | Extends branch-coverage test to assert alerts.log writes. |
| internal/alertlog/alertlog.go | Introduces canonical alerts.log path + format + append helper. |
| internal/alertlog/alertlog_test.go | Adds tests for alert log formatting, append behavior, and TUI parse round-trip. |
| docs/ARCHITECTURE.md | Aligns architecture diagram and component list with loop/alerts.log. |
| cmd/verify/main.go | Adds per-package coverage threshold gate and parsing of go test output. |
| cmd/verify/main_test.go | Adds table-driven tests for parsePackageCoverage. |
| cmd/run.go | Adds top-level healthd run Cobra command using internal/loop. |
| cmd/root.go | Registers run command and removes daemon command. |
| cmd/daemon.go | Deletes deprecated daemon command implementation. |
| README.md | Updates quickstart to use healthd run; removes stale daemon/host-e2e notes. |
| AGENTS.md | Updates contributor commands, hook enablement, and running instructions. |
| .github/workflows/e2e-host-macos.yml | Removes stale workflow for deleted host e2e tests. |
| .github/pull_request_template.md | Adds PR checklist aligned with new verify/hook workflow. |
| .git-hooks/pre-push | Adds pre-push hook to run verifier. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func FormatLine(t time.Time, state, checkName, group, reason string) string { | ||
| return fmt.Sprintf( | ||
| "%s [%s] %s (%s) - %s", | ||
| t.UTC().Format(time.RFC3339), | ||
| state, | ||
| strings.TrimSpace(checkName), | ||
| strings.TrimSpace(group), | ||
| sanitizeReason(reason), | ||
| ) |
There was a problem hiding this comment.
FormatLine claims to produce output that matches the TUI parser, but checkName and group are only TrimSpace’d. If either contains embedded newlines or the delimiter characters used by alertLinePattern (notably "(" in checkName or ")" in group), the resulting line becomes unparsable and the status UI will silently drop it. Consider sanitizing checkName/group (collapse CR/LF like reason, and/or replace/escape parentheses/brackets) or enforce an allowed character set during config validation so the writer truly guarantees parseable output.
There was a problem hiding this comment.
Addressed in commit 2b13992 — chose parse-don't-validate at the config boundary instead of escaping at write time. New validateAlertSafeIdentifier rejects (, ), [, ], and embedded newlines in check.name and check.group at config load, with table-driven tests. The writer remains simple and the operator-facing log lines stay clean.
| return fmt.Errorf("create alerts log dir: %w", err) | ||
| } | ||
|
|
||
| file, err := os.OpenFile(path, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644) |
There was a problem hiding this comment.
alerts.log is created with mode 0644, which makes the log world-readable on multi-user systems. Since this file can contain check names/groups and failure reasons, it’s safer to default to 0600 (and rely on umask only as a further restriction).
| file, err := os.OpenFile(path, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644) | |
| file, err := os.OpenFile(path, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o600) |
There was a problem hiding this comment.
Addressed in commit b52a281 — file mode is now 0o600.
| scanner := bufio.NewScanner(strings.NewReader(output)) | ||
| for scanner.Scan() { | ||
| line := strings.TrimSpace(scanner.Text()) | ||
| if matches := packageCoverageLine.FindStringSubmatch(line); len(matches) == 3 { | ||
| coverage, err := strconv.ParseFloat(matches[2], 64) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("parse coverage for %s: %w", matches[1], err) | ||
| } | ||
| covered = append(covered, packageCoverage{Name: matches[1], Coverage: coverage}) | ||
| continue | ||
| } | ||
| if packageNoStatementsLine.MatchString(line) { | ||
| noStatements++ | ||
| } | ||
| } | ||
|
|
||
| if expected > 0 && len(covered)+noStatements == 0 { | ||
| return nil, fmt.Errorf("parsed coverage for 0 of %d packages; `go test` output format may have changed", expected) | ||
| } | ||
| return covered, nil |
There was a problem hiding this comment.
parsePackageCoverage doesn’t check scanner.Err() after the scan loop. If scanning fails (e.g., an unexpectedly long line exceeds the Scanner token limit), the verifier could silently treat it as “no matches” or miss packages. Add a scanner.Err() check and return a clear error so the coverage gate fails loudly on read/scan issues.
There was a problem hiding this comment.
Addressed in commit fd1d22a — parsePackageCoverage now checks scanner.Err() after the loop and surfaces a "scan go test output" error.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c81dd2d128
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "%s [%s] %s (%s) - %s", | ||
| t.UTC().Format(time.RFC3339), | ||
| state, | ||
| strings.TrimSpace(checkName), | ||
| strings.TrimSpace(group), |
There was a problem hiding this comment.
Escape parsed delimiters before writing alerts.log
FormatLine emits checkName and group verbatim into a line format that the TUI parser expects as ... ([group]) ... with checkName matched by ([^(]+) and group by ([^)]*). Because config validation allows arbitrary non-empty names/groups, a check name containing ( or a group containing ) will produce log lines that LoadRecentAlerts cannot parse, so transitions silently disappear from healthd status. Please escape/encode these fields (or widen the parser contract) before writing.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit 2b13992 — same fix as Copilot's parallel finding. Validation pushed to the config boundary (parse-don't-validate) rather than escape-at-write, so the rule is enforced once at load with a clear error and the writer stays trivial.
| packageCoverageLine = regexp.MustCompile(`^ok\s+(\S+)\s+\S+\s+coverage:\s+([0-9.]+)% of statements$`) | ||
| packageNoStatementsLine = regexp.MustCompile(`^ok\s+(\S+)\s+\S+\s+coverage:\s+\[no statements\]$`) |
There was a problem hiding this comment.
Parse non-
ok coverage lines in package coverage gate
The per-package coverage parser only recognizes lines that start with ok, so lines like github.com/... coverage: 0.0% of statements (the no-_test.go case already covered in main_test.go) are dropped from pkgCovs. That means packages with executable code but no tests can bypass --min-package-coverage, which undermines the new per-package guardrail. Parse those bare coverage lines (or explicitly fail on them) so uncovered packages are enforced consistently.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Not actioning — the per-package gate is intentionally scoped to tested packages. The only bare-coverage line in this repo today is the root package main, which is a 4-line Cobra wrapper exercised transitively through cmd/... (currently 83% covered). Forcing the parser to enforce bare lines would force a contrived test for func main() (which can't be called without os.Exit) or a refactor into a run() int indirection — both worse than the current bypass for one wrapper package.
The real risk this would catch is "someone adds a new untested package with statements" — that's a code-review concern, not something the verifier should police.
Documented this design decision via the existing comment on parsePackageCoverage: "Packages with no _test.go files print a non-matching bare line with no "ok" prefix, so a partial match count is normal — only zero matches is a red flag."
Copilot review flagged that the log file was world-readable (0o644). Alert lines contain check names, groups, and failure reasons — operationally sensitive on multi-user systems. Tighten to owner-only read/write; rely on umask only as a further restriction.
Copilot review flagged that bufio.Scanner errors (e.g., line exceeding the default 64KiB token limit) would silently truncate the parsed output, potentially hiding packages from the per-package gate. Surface scanner errors so the verifier fails loudly on read/scan issues.
Copilot, Codex, and an internal silent-failures review all flagged the
same gap: alertlog.FormatLine writes check names and groups verbatim
into a line shape (`time [state] name (group) - reason`) that the TUI
parser regex `^(\S+) \[([^\]]+)\] ([^(]+) \(([^)]*)\) - (.*)$` cannot
recover if the name contains '(' or the group contains ')'. Transitions
would silently disappear from `healthd status`.
Rather than escape at write time (which still leaves operator-facing
log lines with surprise characters), parse-don't-validate at the
config boundary: reject '(', ')', '[', ']', and embedded newlines in
check names and groups. Surfaces a clear error at config load with
the offending field and character.
Tests: table-driven cases for each rejected character, plus a happy
path with `disk-root` and `host services` (space allowed).
Summary
Session-long readiness pass on healthd. Three substantive changes plus cleanup: (1) wire up the alerts.log writer the TUI was already reading from but nothing was producing, (2) rename
healthd daemon run→healthd runandinternal/daemon→internal/loop(lifecycle is process-compose's job now, the old naming was misleading), (3) add a pre-push verifier hook and a per-package coverage gate so weak new packages can't slip in. Doc drift fixed across the board.Changes
feat(alertlog)— newinternal/alertlogpackage, single source of truth for the log path + format.loop.Runcallsalertlog.Appendon every transition regardless of notifier config; best-effort with stderr error logging. TUI'sDefaultAlertsLogPathdelegates so reader/writer can't drift.refactor—internal/daemon→internal/loop,RunLoop→Run,cmd/daemon.go→cmd/run.go. Top-levelhealthd runreplaceshealthd daemon run. Breaking change — supervisor configs need updating.chore(verify)— added per-package coverage gate (default 70%) with sanity check that fails loudly ifgo testoutput format ever shifts. Table-driven tests for the parser (9 cases). Status indicators standardized to ✓/✗.chore—.git-hooks/pre-pushrunninggo run ./cmd/verify;.github/pull_request_template.md.ci— deleted.github/workflows/e2e-host-macos.yml(was running tests againste2e/host/...which was deleted in63a396a; would have failed every nightly run).docs— README, AGENTS.md, ARCHITECTURE.md, operator skill all reconciled. Removed staledaemon install/status/logs/uninstallreferences, LaunchAgent paragraph,make install(no Makefile),docs/testing.md(deleted in0669eb8).Validation
go run ./cmd/verifypasses locally — total 83.7%, all packages ≥ 70%[crit] failingline lands inalerts.log; per-package parser has 9 unit testshealthd runagainst a temp config with one passing + one failing check, confirmed alerts.log got a real RFC3339 line andhealthd statusrendered it backhealthd --helpno longer listsdaemon;healthd daemon runerrors out with "unknown command";healthd runexists at top levelRisks
healthd daemon rundirectly. Update process-compose / systemd / launchd configs tohealthd run.TestRunAdditionalBranchesis a real-clock test (45ms ticker window); the new alerts.log assertion increases work in that window. Hasn't flaked locally but worth watching in CI. Splitting + clock-injection is a deferred follow-up.Linked Issues
None.
Note
Medium Risk
Medium risk due to a breaking CLI rename (
healthd daemon run→healthd run) that requires supervisor config updates, plus new alerts.log writing and config validation that can change runtime behavior/fail previously-accepted configs.Overview
This PR wires the runtime loop to persist transition events to
~/.local/state/healthd/alerts.logvia a newinternal/alertlogwriter (shared with the TUI), and adds config validation to reject checkname/groupvalues that would break alerts parsing.It renames the foreground daemon command from
healthd daemon runto top-levelhealthd runand renames the underlying package/API frominternal/daemon.RunLooptointernal/loop.Run, updating tests and docs accordingly.Developer/CI hygiene is tightened by adding a pre-push hook and PR template, enhancing
cmd/verifywith a per-package coverage gate (and tests for its parser), and removing the obsolete macOS host e2e workflow.Reviewed by Cursor Bugbot for commit 2b13992. Bugbot is set up for automated code reviews on this repo. Configure here.