Use this checklist after a campaign or capacity run completes. It is intentionally read-only: finalize and report commands should write artifacts, but analysis should not mutate provider monitors, fleet services, or target state.
Use Reporting Standard as the artifact contract for what must exist in the report directory before starting this analysis.
Before interpreting results, make sure the run directory has the artifacts that make the conclusions reproducible:
run.meta.tsvwith the exact campaign window.report.mdandreport.jsonfor scenario accuracy.capacity.md,capacity.json, andcapacity.txtwhenuptime-bench-finalize -capacitywas used or a Jetmon capacity suite was run.- Raw exports for
scenario_runs,ground_truth_events,monitor_reports, andderived_metricswhen the run was produced from a live database. - A redacted service snapshot showing enabled services, monitor IDs or kinds, scenario list, run width, replay count, timing, and adapter versions.
If any of those artifacts are missing, regenerate them before doing manual analysis so the report does not depend on transient database state.
Start with the service-by-service pass counts, then inspect the reasons behind non-passing rows:
true_positive,false_negative, andfalse_positivecounts by service and scenario.capability_mismatchrows as the support matrix, not as failures.adapter_errorrows by provider error text so API drift, duplicate resources, and rate limits stay visible.maintenance_suppressed,cooldown_suppressed, andcooldown_uncertainrows separately from outage accuracy.- Unknown rows with no derived metric, because those usually point to retrieval or normalization gaps.
For HEAD/GET mismatch scenarios, verify that the expected direction was scored:
- HEAD failure with healthy GET should fail a service only if it reports downtime for a visitor-healthy site.
- GET failure with healthy HEAD should fail a service only if it reports uptime for a visitor-broken site.
Compare detection latency only after filtering to rows where detection actually happened:
- Minimum, maximum, mean, median, and p95 detection latency per service/scenario.
- Samples near the scenario end time, which can indicate probe cadence or measurement-window edge cases.
- Resolve/clear events, when available, separately from down-detection events.
- Probe-location metadata, if the adapter preserved it, for geo-sensitive or quorum-sensitive interpretation.
Latency comparisons are only fair between services that were configured with comparable check intervals and monitor kinds.
Adapter errors are benchmark data, but repeated errors should still become engineering follow-ups:
- Group provider errors by stable text or subcode.
- Check whether errors cluster by service, scenario, monitor kind, request method, or replay position.
- Look for leaked provider resources after interrupted or errored runs.
- Confirm cleanup artifacts distinguish deleted, skipped, and ambiguous resources.
- Treat duplicate-resource errors as a cleanup or ownership bug until proven otherwise.
Do not rerun provider cleanup while another campaign is active.
When capacity.md or capacity.json exists, read it alongside accuracy data:
- Last clean batch and first problem batch.
- DB health at the end of each active window.
- Missed-check percentage, recent checks per minute, p95 check age, and oldest check age.
- Prometheus scrape health before trusting CPU, memory, disk, process, or container series.
- Threshold failures and
stop_recommendedreasons. - Cleanup status after each batch.
Capacity failures should not be collapsed into scenario misses. They explain why later scenario accuracy may degrade as active monitor count rises.
Compare the completed run to the most recent comparable baseline:
- Same service ID, adapter version, scenario, monitor kind, method, and interval.
- Pass/total deltas by service.
- Detection-latency shifts large enough to exceed one provider check interval.
- New unsupported/capability-mismatch cells.
- New adapter-error buckets.
- Jetmon v1/v2 behavior on scenarios that previously exposed known issues.
If setup changed between runs, call that out before attributing a regression to adapter or service behavior.
A useful human report should include:
- Run metadata: window, services, scenarios, sample count, width, and timing.
- Service pass/total table.
- Scenario/service matrix with pass, fail, unsupported, unknown, and suppressed categories.
- Detection-latency table for detected failures.
- Adapter-error breakdown.
- Capacity summary when available.
- Key takeaways and recommended next experiments.