Deferred features that are intentionally not yet implemented. Items below the active line are accommodated in the schema and data model so they can be added without breaking changes — but the implementation work is deferred. Items above the line are next-up. The completed section summarizes major shipped capabilities from the commit history so the roadmap shows how the current shape of the system came together.
Active priorities (next-up, in rough order):
- Jetmon v2 deployed scenario smoke — blocked on runner-to-API reachability and deployed harness config.
- Alert cooldown interaction between runs — blocked on Jetmon v1 bridge write-mode provisioning.
- TLS monitor-facing validation
- Live maintenance-window validation
- Report-driven provider reliability
- Provider-state preflight cleanup
- Active-run operational guardrails
- Campaign hardening dry run
- Provider feature coverage gaps
- Next-wave adapter expansion
- Jetmon capacity benchmark
Lower-priority follow-ups:
Recently completed but kept here for audit context:
Deferred:
- Staggered failure measurement matching
- Method-sensitive HTTP behavior beyond status
- Probe IP discoverability (vendor-side)
- Per-component timing retrieval from adapters
- Redirect baseline change detection
- Heartbeat and agent-based reverse checks
- Internet control-plane and provider-scale stretch tests
- Project foundation — repo scaffold, Makefile targets, Docker Compose local dev, operations docs, schema docs, adapter docs, event model, and the system map are in place.
- Service-agnostic design — the harness no longer carries vendor-specific branches; service details live in
services.tomland adapter implementations. - Canonical event log — MySQL-backed
scenario_runs,ground_truth_events,monitor_reports,derived_metrics, andcampaign_runspreserve raw data and support recomputation.
- Single-scenario execution — the harness can provision monitors, activate controlled failures, record ground truth, retrieve monitor events, deprovision, and close the run.
- Measurement engine — raw events are converted into true positive, false negative, false positive, unknown, maintenance-suppressed, cooldown-suppressed/uncertain, TLS advisory, method-sensitive, and latency metrics.
- Reporting tool —
cmd/uptime-bench-reportproduces table, TSV, and JSON campaign summaries with aggregation metadata, bias checks, confidence intervals, capability-mismatch counts, suppression counts, and TLS advisory counts.
- Adapter contract and capability gating — adapters declare check frequency, keyword, maintenance, cooldown, and agent support; incompatible scenario/service pairs become
capability_mismatchrows instead of misleading false negatives. - Implemented adapters — Jetmon v1, Jetmon v2, UptimeRobot, Pingdom, Datadog Synthetics, Better Uptime, Gatus, and Uptime Kuma all have concrete adapters.
- Live API smoke coverage — the public probe-based adapters, Jetmon v1 bridge, and Jetmon v2 API have build-tagged live smoke tests or live-test history captured in docs.
- Per-adapter normalization — each adapter owns raw classification mapping into uptime-bench's common vocabulary.
- Adapter live-run hardening — Jetmon v1 retrieval now treats its initial
SITE_DOWNtransition as an outage report, and UptimeRobot provisioning can clean up duplicate harness-owned monitors or adopt the single matching monitor after a timed-out create call.
- Target server —
cmd/targetserves realistic virtual hosts and injects HTTP status, timeout, partial body, redirect, content, method-specific status, TCP, and TLS failures. - Content failure library — target pages cover canary removal, bad-keyword injection, CMS-style error pages, ransomware, defacement, malicious scripts, and hidden spam links while preserving realistic HTTP behavior.
- DNS server —
cmd/dnsacts as authoritative DNS for the fleet, supports DNS failure modes, serves A/NS/SOA/TXT records, and has ACME TXT control endpoints. - Geo and method-sensitive cases — source-IP filtering enables geo-scoped failures, and HEAD/GET mismatch scenarios cover false-up and false-down risks for HEAD-only monitors across status, redirect, timeout, and partial-body failures.
- Certmint integrated in-repo —
cmd/certmintmints and archives certificates, publishes a cert library API, trims stale entries, and has operator docs and provisioning hooks. - ACME DNS-01 path — DNS members support ACME TXT records and certbot manual hooks, so certmint can issue public certificates through the benchmark nameservers.
- Target TLS support — the target has HTTPS/SNI handling, library certificate selection, healthy certificate fallback, expired/expiring selection, invalid-certificate variants, deprecated TLS modes, and handshake-abort behavior.
- Dynamic cert-library distribution — the harness forwards cert-library URLs to targets; targets poll, cache, swap, prune, and persist certificate-library config across restarts.
- TLS acceptance tooling — local OpenSSL tests cover target TLS behavior, and
deploy/tls-smoke.shexercises deployed target HTTPS, handshake-abort, and deprecated-TLS paths through the real control API.
- Campaign methodology — stratified sampling, two-tier sample depth, reproducible seeds, audit trail, and anti-favoritism constraints are documented.
- Campaign implementation — config parsing, pure design/schedule generation,
campaign_runsschema, scenario translation, serial campaign runner, campaign CLI, checked-in starter campaign config, run-rate budget preflight, campaign replay metadata, mixed-content escalation audit flags, and campaign metric derivation are implemented. - Bias-aware reporting — reports flag service/sample imbalance, missing failure/service cells, capability mismatches, uncategorized Unknown rows, and include confidence intervals before readers compare latency numbers.
- Maintenance windows — scenario parsing, runner gating, adapter provisioning, vendor-side APIs for Pingdom, UptimeRobot, Datadog, Better Uptime, and Jetmon v2, plus
maintenance_suppressedmeasurement classification are implemented. - Overlapping failure-window suppression math — maintenance coverage is computed over the merged union of failure windows, so layered campaign escalations do not double-count overlap when deciding whether an absent alert was suppressed.
- Cooldown reset and classification — capability flags, delete/recreate cleanup paths, campaign replay gating, Jetmon v1 write-mode bridge reset semantics, and cooldown-suppression measurement categories exist where supported.
- Fleet provisioning and deploy flow — scripts create config skeletons, install systemd units, handle DNS port conflicts, deploy binaries, and cover target, DNS, harness, and certmint roles.
- Deployed fleet smoke tooling —
deploy/target-smoke.shanddeploy/dns-smoke.shexercise target HTTP/TCP/TLS injection and DNS-member injection through the real deployed control APIs, including cleanup checks that fail if active failures remain. - Adapter smoke ergonomics — the harness supports a
-monitorsoverride for single-scenario runs, so operators can reuse the checked-in scenario corpus against a specific adapter without creating temporary scenario copies. - HEAD/GET multi-service comparison —
reports/headget-20260429-030809Z/preserves the first long-form run across the HTTP 503 control and HEAD/GET mismatch matrix, including raw TSV exports, derived JSON, a redacted service snapshot, and analysis notes. - Documentation front door — the root README is now a concise project overview, while detailed design, operation, scenario, adapter, event, and roadmap references live under
docs/. - Probe IP refresh automation —
cmd/probe-ips-refreshgenerates reviewable probe-range fragments,make refresh-probe-ipsgives operators a local review command, and a weekly GitHub Action opens a PR with the latest generated fragment for operator review. - Regression coverage — tests cover parsers, runner error handling, adapter factories, DNS handlers, target handlers, cert selection, campaign anti-favoritism, reporting, and live-test compilation.
- CI checks — build, vet, live-test compilation, race testing, and formatting/tidiness checks are represented in the project workflow.
Status: Partially validated; blocked on the deployed harness path. Direct deployed target/DNS smoke passed on 2026-04-28, local Jetmon v2 API contract smoke passes with a current token, and workstation-run harness smoke against the deployed fleet now passes the two highest-priority HEAD/GET mismatch scenarios. The full deployed-harness path against Jetmon v2 is still blocked because the runner host cannot currently reach the dev API and its service config is not enabled for Jetmon v2.
Rechecked on 2026-04-28:
- The Jetmon v2 API health endpoint at the current dev address is reachable from the workstation.
- A replacement API token returns
200 OKfrom/api/v1/meand the build-tagged Jetmon v2 live adapter tests pass locally, including provision, retrieve, API contract, and deprovision. - Workstation-run harness smoke passed
http-503: run5aa0a909c9e92249ec657ad995cf3daaclosed withplanned_completion, retrieved rawserver/ normalizedhttp_failurewithhttp_code=503, derivedtrue_positive=1,false_negative=0,false_positive=0,detection_latency_s=246.447102, and left the target control registry clean. No resolve event arrived within the three-minute grace window. - Workstation-run harness smoke passed
http-timeout-ttfb: run4f13f9bf24fca09e86149f498decda47closed withplanned_completion, retrieved rawtimeout/ normalizedtimeoutwitherror_code=1, derivedtrue_positive=1,false_negative=0,false_positive=0,detection_latency_s=287.160244, and left the target control registry clean. An earlier run (c96b5e683135158b85d17111a8bb4f73) exposed a scoring edge where a timeout probe can report just afterfailure_end; measurement now extendshttp_timeoutwindows by the configured delay to credit in-flight probes correctly. - Workstation-run harness smoke completed
http-partial: run076d59b17823d827056fe73259fa6c51closed withplanned_completion, retrievedstatus=known reports=0, derivedfalse_negative=1,true_positive=0,false_positive=0, and left the target control registry clean. This is useful benchmark evidence: current Jetmon v2 does not detect a 200 OK response that closes after a truncated body. - Workstation-run harness smoke passed
content-keyword-missing: runcf9d071f96faedeec31a61ddcfb267c2closed withplanned_completion, retrieved rawkeyword/ normalizedcontent_failurewitherror_code=5, derivedtrue_positive=1,false_negative=0,false_positive=0,detection_latency_s=183.007297, and left the target control registry clean. - Workstation-run harness smoke confirmed
content-keyword-injectedis a Jetmon v2 capability mismatch for inverted keyword checks: run7d87d3dab8d6f5c29d77cab5e88bcc1arecordedreason_code=capability_mismatchand derivedunknown=1with no false negative. That live run also showed the runner still activated the target after all adapters were gated; the runner now short-circuits all-gated runs before failure injection. - Workstation-run harness smoke using a temporary Jetmon v2 services config passed
http-head-200-get-503against the deployed target fleet: run81d439ab1fb15ba53df06dc5adaad71bclosed withplanned_completion, retrievedalert_fired/alert_resolved, derivedtrue_positive=1,false_negative=0,false_positive=0, and left the target control registry clean. - Workstation-run harness smoke passed
http-head-405-get-200after fixing method-sensitive metric semantics: rund00ad54bf28dcb9577e78f02f2b0d17cclosed withplanned_completion, retrievedstatus=known reports=0, derivedfalse_negative=0,false_positive=0, and left the target control registry clean. - Workstation-run harness smoke passed
tls-invalid-self-signedagainst the deployed target fleet: runae0ca69b79f065a7c1817ad291f7d159closed withplanned_completion, retrievedalert_fired/alert_resolved, derivedtrue_positive=1,false_negative=0,false_positive=0,detection_latency_s=36.043472, and left the target control registry clean. Rerune8aa57f15b872004d2ab5dad6cbe1293confirmed the updated adapter stores rawssl, normalizedtls_failure,error_code=3,true_positive=1, anddetection_latency_s=35.250424. - Workstation-run harness smoke completed
tls-deprecated-tls11: run320b79e25edea3fd2c1f76bf8addb4f6closed withplanned_completion, retrieved rawssl/ normalizedtls_failure, derivedtls_advisory_false_outage=1,tls_advisory_missed=0,true_positive=0, and left the target control registry clean. This shows current Jetmon v2 behavior treats TLS 1.1 as a hard TLS failure rather than a warning-level advisory. - Workstation-run harness smoke passed
tls-handshake-version-mismatch: run0843d3e1c73357184f2e023dd2d1f1e6closed withplanned_completion, retrieved rawssl/ normalizedtls_failure, derivedtrue_positive=1,false_negative=0,false_positive=0,detection_latency_s=252.643508, and left the target control registry clean. No resolve event arrived within the three-minute grace window, so recovery-latency scoring may need a longer grace or a final post-deprovision refresh if it becomes a benchmark metric. - Jetmon v2 adapter retrieval now includes
tls_expiryevents and preserves Jetmonerror_codesemantics in raw classifications, so TLS, timeout, redirect, and keyword failures no longer collapse into generic HTTP state labels. - The harness server times out when calling the same API health endpoint, so runner-to-API reachability is still blocked.
- The deployed harness
/etc/uptime-bench/services.tomlstill hasjetmon-v2disabled with no API URL or token configured.
Attempted on 2026-04-28:
- The deployed harness binary was stale and could not parse
http_method_status; redeploying the current harness fixed that parser gap. - The deployed harness did not have a configured Jetmon v2
url/ token in/etc/uptime-bench/services.toml. - The deployed harness could not reach the developer Jetmon v2 API at the private test address, while the local workstation could reach
/health. - The previously supplied Jetmon v2 tokens returned 401 from
/api/v1/me, so local smoke could not provision a site. The partial run deactivated the target failure normally and the target control registry was clean afterward.
Resume this item when the runner host has a reachable Jetmon v2 API URL and the deployed harness service config is enabled with a current write-scope token. The harness now supports -monitors=jetmon-v2, so the checked-in scenario corpus can be reused for Jetmon v2 smoke without creating temporary scenario copies.
Run the remaining small monitor-facing scenario set through the real deployed fleet and the Jetmon v2 adapter before broadening to cross-vendor campaigns. This should validate the complete loop: harness provisioning, Jetmon v2 API calls, monitor behavior against injected target failures, retrieval, metric derivation, cleanup, and no remaining active fleet failures.
Initial scenario set:
- HEAD/GET mismatch cases:
http-head-405-get-200.tomlandhttp-head-200-get-503.tomlare passing from the workstation-run harness path; repeat from the deployed harness once reachability/config are fixed. - Basic outage and timing cases:
http-503.tomlandhttp-timeout-ttfb.tomlare passing from the workstation-run harness path;http-partial.tomlruns cleanly but is a Jetmon v2 false negative. - Content/keyword cases:
content-keyword-missing.tomlis passing from the workstation-run harness path;content-keyword-injected.tomlis correctly gated ascapability_mismatchfor Jetmon v2 because inverted keyword checks are unsupported; one high-signal compromise page such ascontent-defacement.tomlremains to run.
Acceptance:
- Every run records
scenario_runs,ground_truth_events,monitor_reports, and derived metrics without adapter errors. - Jetmon v2 monitor provisioning and cleanup leave no orphaned test sites.
- Target and DNS control registries are clean after every run.
- The HEAD/GET mismatch scenarios specifically exercise the false-down and false-up risks that Jetmon v1 is expected to fail and Jetmon v2 should not fail.
Status: Implemented. Design locked 2026-04-25; end-to-end wiring, capability gating, adapter keyword branches, and tests are now in place.
Found during a review pass on 2026-04-25. At the time, the pieces existed independently but never met:
adapter.Capabilities.SupportsKeywordis set totrueon Pingdom, UptimeRobot, Datadog, and Better Uptime; tofalseonjetmon-v1. Currently nothing ininternal/runnerreads either flag, so it has no effect.adapter.ProvisionConfig.Keywordexists on the struct (internal/adapter/adapter.go:91) but the runner builds the config with onlyCheckFrequency(internal/runner/runner.go:144) —Keywordis never populated.- Each adapter's
Provisionignoresconfig.Keyword. Pingdom always creates a status (type=http) check; UptimeRobot usesmonitorTypeHTTP = 1(a keyword check would betype=2); Datadog only adds astatusCodeassertion; Better Uptime always usesmonitor_type = "status". - Scenario TOMLs already carry
keyword = "uptime-bench-canary"for the keyword scenarios, and the runner forwards that string to the target binary's control plane (internal/runner/runner.go:403), which uses it to know what string to remove or inject when serving tampered content. So the target side is keyword-aware — the monitor side is not.
Current state: scenario-level keyword / keyword_check are parsed, defaulted for content scenarios, passed through the runner into ProvisionConfig, and gated against SupportsKeyword / SupportsInvertedKeyword. Pingdom, UptimeRobot, Datadog Synthetics, Better Uptime, and Jetmon v2 all exercise their supported keyword paths in unit tests; unsupported combinations produce reason_code = "capability_mismatch" instead of false negatives.
- Scenario format —
keywordis a top-level scenario field (not per-failure), since it's a property of monitor configuration. Sibling fieldkeyword_checktakes valuespresent(alert when keyword absent — the canary case) orabsent(alert when keyword present — the injected-bad-keyword case). When a scenario contains any content failure but doesn't setkeywordexplicitly, the runner defaultskeyword = "uptime-bench-canary"andkeyword_check = "present". - Capability gating — when the scenario sets a keyword and the adapter's
SupportsKeyword == false, the runner skips Provision and writes a singlemonitor_reportsrow withStatus = Unknownand a structuredreason_code = "capability_mismatch"plus a free-form Reason describing the missing capability. Same pattern the runner already uses forMinCheckFrequency. Capability-mismatch results are first-class data, not noise — they are the support matrix for "which services support which features," which is a project deliverable. Reporting must distinguish them from genuine false negatives (see events.md). - Better Uptime inverted keyword support — Better Stack's monitor API supports both
monitor_type = "keyword"andmonitor_type = "keyword_absence". The adapter now exposesSupportsInvertedKeyword = trueon GET lanes and gates keyword checks on HEAD lanes, where no response body exists.
- ✅ Schema migration — add
reason_codecolumn tomonitor_reports(nullable string; existing rows back-fill empty). Updateinternal/dbwrites and the runner'slogMonitorReportto set it. This is the prerequisite that lets capability gating be queryable as a support matrix; ship before any of the keyword work to keep the migration small and isolated. - ✅ Scenario format — promote
keywordto scenario-level and addkeyword_check; updateinternal/scenarioparser, validator, and the corpus check; update existing keyword scenarios to the new format. - ✅ Runner — populate
ProvisionConfig.KeywordandProvisionConfig.KeywordCheckfrom the scenario; add capability-gating branch that mirrors the existingMinCheckFrequencybranch. - ✅ Adapters — branch on
config.Keyword != "":- Pingdom:
newCheckRequestcarriesshouldcontain(present check) orshouldnotcontain(absent check). Type stays"http". - UptimeRobot: switch
typefrom1(HTTP) to2(Keyword); setkeyword_type=1for "exists" (present check) orkeyword_type=2for "not exists" (absent check); setkeyword_value. - Datadog: append a
bodyassertion to the existingAssertionslist with operatorcontains(present) ordoes not contain(absent). - Better Uptime: switch
monitor_typefrom"status"to"keyword"for present checks and"keyword_absence"for absent checks; setrequired_keyword. Keyword monitors force GET.
- Pingdom:
- ✅ Tests — unit tests for each adapter's keyword branches; integration test that asserts capability gating writes a
monitor_reportsrow with the expectedreason_codeinstead of a Provision call; corpus check covers the new scenario fields.
Remaining follow-up: broaden live API smoke coverage for each vendor's keyword branch, especially inverted keyword checks and Better Uptime's keyword_absence mode. Jetmon v1 remains SupportsKeyword = false for this comparable keyword axis.
Status: Target/DNS direct acceptance is implemented; scenario corpus coverage is now present for monitor-facing TLS validation. Aged-certificate and live monitor-facing TLS validation remain active priorities.
The target binary now exposes an HTTPS listener with a generated self-signed fallback certificate. With -cert-library-manifest, healthy requests use the longest-valid matching library certificate, while active tls_expired and tls_expiring failures select the closest matching expired/expiring certificate for the request SNI. tls_invalid can force the generated self-signed cert or a generated hostname-mismatch cert, tls_deprecated can clamp the HTTPS listener to TLS 1.0 or TLS 1.1, and tls_handshake aborts the handshake before certificate selection. Remaining TLS work: deployed-fleet probe acceptance against a real certmint-produced library.
Deployed direct acceptance now covers healthy HTTP/HTTPS, method-sensitive HEAD/GET mismatches, HTTP status/body/redirect/partial/timeout failures, global tcp_refused, tls_invalid self-signed and hostname-mismatch variants, tls_handshake, and tls_deprecated across both live cert-library domains. It also covers direct DNS-member injection on both nameservers. Checked-in monitor-facing scenarios now cover tls_invalid self-signed and hostname mismatch, tls_handshake version mismatch, tls_deprecated TLS 1.1, tls_expiring at five days, and tls_expired at thirty days. First monitor-facing Jetmon v2 smoke passed for tls-invalid-self-signed on 2026-04-28; a rerun after the adapter fix stored raw ssl and normalized tls_failure for Jetmon v2 error_code=3 instead of generic http_failure. tls-deprecated-tls11 also runs end-to-end, but Jetmon v2 currently reports it as a hard TLS outage, so uptime-bench correctly records tls_advisory_false_outage=1. tls-handshake-version-mismatch now has monitor-facing true-positive evidence, with a note that late detection did not resolve within the current grace window. Remaining TLS work is narrower: run the rest of those scenarios through real monitors, and repeat tls_expired / tls_expiring acceptance once certmint has aged snapshots suitable for those scenarios.
- Add a
:443listener tocmd/targetusingcrypto/tls. - Generate a self-signed fallback cert on startup; future work may persist it under
/etc/uptime-bench/tls/if stable fingerprints become useful. - Wire the listener to the same virtual-host router used for
:80so healthy pages work over HTTPS. - Acceptance:
curl -k https://bench-a.<domain>/returns the canary body; the shipped HTTP/TCP/content scenarios still pass on the HTTPS variant.
- Pre-generate a library of certs at varying ages: fresh (90 days remaining), expiring soon (1, 5, 30 days remaining), already expired (1 day, 30 days, 1 year), self-signed by an unknown CA, signed for the wrong hostname.
- Producer/consumer split inside the repo:
cmd/certmintowns real Let's Encrypt/certbot issuance and writes an immutable library plusmanifest.json(internal/certmint/); the target's TLS listener consumes the manifest viainternal/certlibrary/for SNI-aware selection, with deterministic fleet-CA/self-signed fallbacks when no public cert applies. Theuptime-bench-dnsmember exposesPUT/DELETE /acme/txtcontrol endpoints so certmint's certbot manual hooks (deploy/acme-hooks/) can install DNS-01 challenge records on the same nameservers that resolve the benchmark hostnames — no Cloudflare delegation needed for the runtime domains. - New control API params for
tls_expired/tls_expiringselect a library member at activation time.tls_invalidsupports self-signed and hostname-mismatch variants. - Library structure:
manifest.jsonis the contract. Filenames may encode age/profile for operator readability, but the target must select by manifest metadata rather than reparsing certificates at request time. - Local acceptance: real in-process TLS handshakes and OpenSSL
s_clientchecks confirmtls_expired days_expired=30andtls_expiring days_remaining=5serve the matching library certificate. Remaining fleet acceptance should repeat this against a certmint-produced library on deployed targets.
tls_handshake: implemented for target-side config selection by returning a deterministic handshake error before certificate selection. Probe receives a TLS alert; no HTTP response.tls_deprecated: implemented for target-side config selection by clampingtls.Config.MaxVersionto TLS 1.1 or TLS 1.0. In-process TLS handshake tests cover the target behavior.- OpenSSL acceptance:
openssl s_client -tls1_3 ...fails handshake whentls_handshakeis active;openssl s_client -tls1_1 ...succeeds whentls_deprecatedis active.deploy/tls-smoke.shrepeats the protocol checks against deployed targets through the control API, anddeploy/target-smoke.shfolds those checks into broader HTTP/TCP/TLS deployed-target acceptance. Remaining fleet acceptance is monitor-facing probe smoke against a real certmint-produced library with aged snapshots fortls_expired/tls_expiring.
Measurement note for tls_deprecated: implemented. Because the request actually returns 200 OK, monitor outcomes split three ways: missed advisory, correct tls_advisory classification, or false outage report. The measurement engine records these as tls_advisory_missed, tls_advisory_detected, and tls_advisory_false_outage, distinct from true-positive and false-negative.
- Generate a single CA root for the fleet; sign all virtual-host certs from it.
- Each monitor under test must trust the fleet CA. Status by service:
- Jetmon (self-hosted) — trivial. Install the fleet CA on the agent host.
- Datadog Synthetics — supports custom CA via API; needs concrete validation.
- Pingdom / UptimeRobot / Better Uptime — research needed; some likely don't support custom CAs at all.
- For monitors that don't support custom CAs, the interim path is real Let's Encrypt certs: keep Phase 1's listener but back it with a Let's Encrypt issuer. Document the tradeoffs (Phase 2 cert library is harder to construct from real-issued certs, especially for "1 year expired").
- Phase 4 unblocks ~100% TLS coverage; Phases 1–3 already unblock the bulk using self-signed.
Required by Phase 1 once we have multiple virtual hosts, but ordering with the phased work above is flexible. The TLS listener inspects SNI and serves the matching cert; without this, multi-site scenarios run only on whichever cert was bound to the listener default.
Status: Next after Jetmon v2 scenario smoke, cooldown, TLS validation, and maintenance validation. The campaign runner is implemented for serial single-target execution, but it still needs a small live dry run before it is trusted for publishable data.
Run a deliberately small campaign against Jetmon v2 only, using a single target and a narrow scenario mix. The point is not statistical power yet; it is to expose long-run cleanup, scheduling, rate-limit, retry/error, database, and reporting problems while the blast radius is small.
Acceptance:
- Campaign preflight accepts the config without exceeding service run-rate budgets.
- Every replay either produces usable metrics or a classified, queryable reason such as
capability_mismatch,adapter_error,cooldown_suppressed, orcooldown_uncertain. uptime-bench-reportproduces table, TSV, and JSON output from the resulting campaign run.- No monitor, target, or DNS state is left active after the dry run.
Status: In progress. A provider feature pass found several service capabilities that uptime-bench should model explicitly before or alongside the next adapter wave. Some are now represented as scenario files; others need new adapter capability fields or monitor-kind support so results are not distorted by forcing every provider into an HTTP-status monitor shape.
Implemented locally in the scenario corpus:
- Additional content-body failures:
content-emptyandcontent-error-page. - Redirect coverage:
http-redirect-loopandhttp-redirect-chain. - Timeout coverage:
http-timeout-bodyandhttp-timeout-totalin addition to the existing TTFB stall. - TCP coverage:
tcp-timeoutin addition totcp-refused. - DNS coverage:
dns-nxdomain,dns-servfail,dns-timeout,dns-cname-nxdomain,dns-latency, and bothdns_ns_unavailablemodes. - Slow-success coverage:
http-latency-thresholdexercises response-time threshold assertions without turning the HTTP request into a timeout. - Header-sensitive coverage:
http-header-statusverifies that an adapter can configure custom request headers and that the target can fail only matching monitor probes.
Adapter-surface improvements started:
- Better Uptime now supports GET/HEAD status-lane configuration and
keyword_absencefor forbidden-content checks on GET lanes. - Datadog Synthetics now supports GET/HEAD HTTP API-test configuration, disables body assertions on HEAD lanes, configures custom request headers, configures response-time assertions, and preserves result location metadata when the API returns it.
- Jetmon v2 now receives custom request headers from scenario config through its adapter.
services.example.tomldocuments the optionalhttp_methodsetting for Better Uptime and Datadog Synthetics.- Scenario parsing, adapter capabilities, and runner capability gating now understand native
monitor_kindvalues (http,dns,tcp,ssl_certificate,heartbeat), customrequest_headers, andresponse_time_threshold. - DNS scenario execution now targets DNS control members directly, supports
fresh_hostname = truefor generated per-run hostnames, and recordsfailure_not_observableunknown rows when authoritative preflight probes cannot see the injected DNS failure.
Feature gaps to model next:
- Native DNS/TCP/SSL monitor adapter paths. The schema/gating layer is implemented, but every current adapter still provisions HTTP monitors unless explicitly extended. This is deferred because each provider uses a different request and result schema for native DNS, TCP/port, and SSL/certificate monitors; enabling those paths without stale cleanup and classification tests would produce misleading comparisons.
- DNS resolver-exposure artifacts. Authoritative DNS preflight is implemented. A later reporting pass should also capture service-host resolver checks when available, raw Jetmon v2 DNS event metadata/transitions, and
dns.*StatsD metrics so DNS failures can be separated into "not injected", "not recursively observable", and "observable but missed by the service" buckets. - Dedicated certificate and domain-expiry products. Current TLS scenarios test HTTPS probe behavior. Native SSL/certificate monitor kinds should come next after one provider adapter is wired through
monitor_kind = "ssl_certificate". Domain-expiry monitors remain deferred because reliable simulation requires registrar/RDAP behavior rather than only target TLS behavior. - Heartbeat/push checks.
monitor_kind = "heartbeat"is reserved, but no adapter provisions heartbeat monitors yet. The intended first implementation is harness-owned heartbeat sending: adapter provisions a heartbeat endpoint, the harness sends check-ins during healthy periods, andheartbeat_stoppedpauses those check-ins. This is deferred until the first adapter exposes heartbeat creation and stale cleanup. - Header/auth-sensitive checks beyond custom headers. Custom request-header support is implemented for Datadog and Jetmon v2. Authentication flows and user-agent divergence remain deferred because they need clearer cross-provider request-shape controls and target fixtures beyond a single deterministic header.
- ICMP/ping checks. Treat ping checks as a separate monitor kind. Deferred because credible ICMP failure injection needs host/firewall-level or isolated-VM control; doing this on the shared target fleet could interfere with concurrent HTTP/DNS/TLS scenarios.
- Browser/API assertion checks. Datadog and Checkly can run richer API or browser assertions. Deferred from the first comparison table because browser checks have different cost, flake, dependency, and timing behavior than single-probe uptime checks. API assertions should be the first sub-track when a provider adapter needs richer structured checks.
- Regional quorum and probe-location behavior. Datadog result location metadata is now preserved when present. Provider-controlled location selection and alert quorum are deferred because they require provider-specific provisioning fields and report dimensions; publishable geo results should wait until location/quorum settings are explicit in run metadata.
Acceptance:
- Scenario docs and campaign configs include the new scenario files without requiring temporary one-off TOML copies.
- Adapter capability reporting distinguishes unsupported monitor kinds from false negatives.
- New monitor-kind adapter paths include stale-resource cleanup before they are enabled in campaign runs.
- Reports can break out "HTTP monitor observing DNS/TLS/TCP failure" from "native DNS/TLS/TCP monitor" so service comparisons stay fair.
Status: In progress. Gatus and Uptime Kuma have deployed self-hosted instances on single-service hosts plus narrow uptime-bench bridges. Initial harness-driven smoke covered provision, retrieve, deprovision, cleanup, and an injected HTTP failure for both adapters. They are ready for controlled campaign inclusion, with the caveat that reports must identify them as single-origin self-hosted checks rather than global SaaS probe networks.
Recommended order:
- Uptime Kuma — first self-hosted UI-driven comparison point. Deployed with pinned
louislam/uptime-kuma:2.3.0and an uptime-bench bridge because direct automation uses Uptime Kuma's internal Socket.IO surface. Initial adapter coverage is HTTP status and present-keyword checks; inverted keyword, maintenance, TCP, DNS, and TLS/cert support remain deferred until validated. - Gatus — second self-hosted comparison point. Deployed with pinned
ghcr.io/twin/gatus:v5.35.0and an uptime-bench bridge that manages a generated config fragment while reading endpoint status/history from the public API. Initial adapter coverage is HTTP status, present/inverted keyword checks, response-time threshold, and custom request headers. Next capability work is validating Gatus native DNS/TCP/TLS checks behind explicitmonitor_kindsupport. - updown.io — first additional third-party service. Its API is simple, supports create/update/delete checks, exposes downtimes, publishes node/IP APIs, supports HTTP/TCP/ICMP-like coverage, string matching, and configurable
GET/HEADbehavior. - StatusCake — useful market comparison with uptime APIs and period/history endpoints.
- Checkly — high-capability API checks with method/assertion support; valuable after the simpler API-shaped adapters prove out the expansion path.
- Grafana Cloud Synthetic Monitoring — useful blackbox-style synthetic monitor with a REST API, but setup/auth and result retrieval are more involved.
Expansion acceptance:
- Each new adapter implements provision, retrieve, deprovision, capability declaration, normalization, and stale-resource cleanup before it is enabled for campaign runs.
services.example.tomldocuments auth, capacity, and any probe-range/region fields for the service.- The first live smoke for each adapter covers at least
http-503, one HEAD/GET mismatch, one content/keyword scenario if supported, and cleanup verification. - Self-hosted adapters are labeled as single-origin/self-hosted in reports so they are not confused with global SaaS probe networks.
- Any adapter relying on an unstable or internal upstream API must pin the upstream version and document the automation risk.
Status: In progress after reports/v2-regression-9am-20260502-063755Z/.
The latest long run had no Jetmon v2, Pingdom, or Better Uptime adapter
errors, but it exposed 56 provision-time errors across UptimeRobot and Datadog
Synthetics:
- UptimeRobot: 39 total (
maintenance start_time invalid_parameter: 15,newMonitor already_exists: 22, API timeout afternewMonitor: 2). - Datadog Synthetics: 17 total (
downtime invalid scope: 17).
Implemented hardening from that report:
- UptimeRobot one-shot maintenance windows now round/clamp
start_timeto the next full second so a subsecond run start is not sent as a timestamp that is already just barely in the past. - UptimeRobot monitor names now include a short hash of the monitor URL, and the runner spreads monitor URLs across configured site paths with a per-scenario query token. This reduces duplicate-name and duplicate-URL collisions in parallel matrices while keeping failures scoped to the same path the monitor checks.
- UptimeRobot uncertain create timeouts retry safely after attempting to adopt a single matching harness-owned monitor.
- Datadog downtime creation now sends an explicit global scope alongside the monitor id.
- Provision/retrieve failures now write
monitor_reportsrows withreason_code = "adapter_error", and markdown campaign reports include a reason-code table so these failures are visible without reconstructing them from logs. - Human-readable and JSON campaign reports now also include reason-detail buckets split by provider/adapter error text, so errors such as duplicate monitor collisions, maintenance timestamp validation failures, API timeouts, and provider schema errors can be compared without mining raw logs.
Acceptance for the next run:
- Live smoke UptimeRobot and Datadog on
maintenance-http-503-full-cover; both should provision maintenance without adapter errors and should score as maintenance-suppressed if the vendor behaves as expected. - The next matrix should show zero, or nearly zero, UptimeRobot
already_existserrors. If they persist, enhance stale cleanup to search by benchmark URL in addition to exact friendly name and add a plan-level duplicate URL preflight. - Report generation buckets adapter-error reasons by provider error text, not
only by the structured
adapter_errorcode. The next report review should confirm those buckets are specific enough in real output; if they are still too noisy, adapter-specific error normalization can add stable subcodes without changing the raw-detail table. - Geo-scoped
http-geo-503remains a benchmark validation gap, not a service-specific finding, because every service failed it in the latest run. Before publishing geo results, audit probe CIDRs against actual source IPs observed by the target and flag services with unverifiable probe ranges.
Status: MVP implemented for public API adapters. cmd/uptime-bench-cleanup can run independently before a matrix batch, loads the enabled services and fleet scope, supports dry-run/delete modes, summarizes per-service counts, reports reason/kind breakdowns for idempotency diagnostics, and uses adapter-owned stale cleanup for UptimeRobot, Pingdom, Datadog Synthetics, and Better Uptime. Jetmon v1/v2 intentionally remain unsupported by this generic provider cleanup until their synthetic benchmark-site ownership rules are explicit enough to delete safely.
Add a cleanup command or harness preflight that lists harness-owned monitors/tests/checks for every enabled provider, filters only resources with uptime-bench-owned names/tags/URLs, and deletes stale resources left by an interrupted run, timed-out provider API call, or host reboot. The command should have a dry-run mode, print per-provider counts, fail closed when ownership is ambiguous, and leave non-benchmark monitors untouched.
Provider-specific ownership signals:
- UptimeRobot:
friendly_nameprefixuptime-bench:plus matching benchmark URL. - Pingdom: check name prefix
uptime-bench:plus host/path match. - Datadog Synthetics:
uptime-benchtag anduptime-bench:<target_id>tag. - Better Uptime:
pronounceable_nameprefixuptime-bench:plus benchmark URL. - Jetmon v1/v2: bridge/API-created synthetic benchmark sites only; read-only/pre-seeded Jetmon v1 monitors should not be deleted.
Acceptance:
- ✅ Preflight can run independently before a matrix batch and summarize
found,deleted,skipped, anderrorcounts per service. - Stale cleanup happens before provider capacity checks so leaked monitors do not consume Better Uptime/UptimeRobot/Pingdom caps.
- ✅ Deletion uses the same idempotent deprovision paths as normal teardown where possible.
- ✅ Ambiguous matches are reported but not deleted automatically.
- ✅ Cleanup summaries include reason/kind breakdowns so repeated dry-runs can show whether remaining resources are unsupported, ambiguous, stale, or outside scope.
Status: Documented as an operator constraint; tooling guardrails are not implemented. During a long campaign or capacity suite, do not run provider cleanup, deploy binaries, sync fleet config, restart target/DNS services, run adapter live smokes, or mutate report directories. Those actions can delete monitors, change target routing, restart the failure surface, or alter provider state while the run is collecting evidence.
This was intentionally skipped during an active overnight test. Safe concurrent work is limited to read-only inspection and local-only code, docs, and unit tests that do not call the live fleet or provider APIs.
Remaining follow-up:
- Add an active-run guard to mutating operator commands (
uptime-bench-cleanup, deploy helpers, fleet sync helpers, and live smoke wrappers) so they warn or fail closed when a campaign/capacity run appears to be active. - Define the active-run signal. Good candidates are an in-progress
campaign_runsrow, a capacity suite state/lock file, or an explicit operator-owned lock path. The signal should be local to the orchestrator and should not require provider API calls. - Add a documented override flag for emergency cleanup, with the command printing exactly which running campaign or lock was ignored.
- After the current active test window completes, run the parked live tasks: provider cleanup dry-run, adapter smoke checks, target/DNS cleanup checks, and any needed fleet config sync/restart.
- Keep report finalization separate from provider cleanup so analysis can be regenerated without changing live monitor state.
Status: Partially implemented. Methodology locked 2026-04-27. Config parsing, deterministic generation, cooldown-aware scheduling, serial execution, single-target scope preflight, a checked-in starter config, run-rate budget preflight, mixed-content escalation instrumentation, metric derivation, markdown/JSON finalization, preflight timing estimates, and reporting are in place; remaining work is final published-campaign sampling policy, concurrent execution, and live-campaign hardening.
The harness today runs one scripted scenario at a time. That model is fine for targeted tests ("does Pingdom detect a 503?") but it can't produce the data the project actually exists to publish: min, max, and average detection times of specific kinds of failures across the different services, computed from enough samples that the numbers are defensible.
A campaign is a long-running orchestration mode where the harness self-generates randomized scenarios for hours or days against the live fleet, accumulating thousands of scenario_runs rows that can be aggregated into per-(failure_type, service) statistics. Existing single-scenario mode remains for targeted testing.
Pure weighted-random ("Chaos Monkey style") was considered and rejected. It's the right tool for resilience testing in production but produces a Poisson-like sample distribution across failure modes that breaks every requirement of a benchmark: reproducibility, coverage parity, comparability, and statistical power. Some cells get oversampled; others undersampled; per-service comparisons over different distributions become meaningless.
Instead, every campaign uses stratified random sampling:
- A cell is one combination of
(failure_type, duration_bucket, host_pattern). Every scenario generated for a cell tests all enabled services simultaneously — same target, same timing, all monitors fanout — so per-service comparisons are over an identical distribution by construction. - Each cell receives a target sample count (n=20 default; see two-tier rule below). The campaign generates exactly that many scenario designs per cell, deterministically from the master seed.
- Each design is replayed N times across the campaign duration, with the replay times distributed across hour-of-day and day-of-week to break cadence-alignment bias. This is the part that picks up the Chaos-Monkey instinct: same logical scenario, run at unpredictable absolute times, so a service whose internal scheduling happens to align (or misalign) with our cadence can't get a systematically wrong-looking number.
- Within each cell, the details are randomized per-design: which exact duration in the bucket's range, which exact status code, which target from the pool — but the cell totals and the global structure are not.
Net effect: campaigns are reproducible (same master seed → same design set → same execution sequence), comparable (every service sees the same distribution), and resistant to cadence/parameter-alignment bias.
Not every cell needs the same number of samples. The driver isn't real-world frequency of the failure mode — it's the expected size of inter-service differences in the resulting numbers. Counter-intuitively, "common" failures often need more samples, not fewer:
- Wide-margin cells (e.g., content tampering, certificate revocation): some services support them, some don't. The interesting result is the support matrix and the fact-of-detection. n=20 is plenty to establish "service A detects, service B doesn't, service C is
capability_mismatch." - Narrow-margin cells (e.g., HTTP 5xx, TLS expiration): every service detects these. Differences are seconds, not orders of magnitude. To say "service A is faster than service B" with confidence, you need enough samples that the confidence intervals don't overlap — typically n ≥ 50.
So the campaign config specifies a default tier (n=20 per cell) and an explicit high-discrimination tier (n=60 per cell) for failure types where service-to-service differences are expected to be small.
For a first-iteration campaign, the methodology locks in:
- K = 50 scenario designs total (across all cells in both tiers).
- N = 20 default replays per design for default-tier cells; N = 60 replays per design for high-discrimination-tier cells.
- Per-campaign run total: roughly 1,000–1,500 scenario runs depending on the tier split.
- Diminishing returns hit around n=60–80. Beyond that, extra samples don't visibly tighten confidence intervals on a typical detection-latency distribution. n=60 is the cap for the high-discrimination tier; pushing it higher mostly buys smugness, not signal.
These numbers can be re-evaluated after the first real campaign produces data showing where the actual noise floor lies.
This project is open-source, public, and intended to evaluate Jetmon (the maintainer's product) honestly against the competition. To make the benchmark trustworthy for everyone — readers, competing services, and the maintainer — anti-favoritism must be enforced mechanically in the code, not relied on as a policy:
- The campaign generator and reporter contain no service-specific branches. No
if serviceID == "jetmon-v1"anywhere in the campaign or reporting code, ever. A simple lint test in CI grepping for known service IDs in those files would enforce this. - The reporter's bias self-checks run before any latency numbers: per-service sample counts, per-cell sample counts, and any service whose count deviates from the others by more than a stated threshold is flagged in the output. If Jetmon ran 100× and Pingdom 80× because of budget differences, that's the first line of the report, not buried.
- Confidence intervals are mandatory, not optional. Every percentile in the output ships with a CI so readers can judge whether differences are meaningful. "Pingdom 70s ± 12s, UptimeRobot 95s ± 18s" is publishable; "Pingdom 70, UptimeRobot 95" is not.
- Errors are part of the data, not retried away. If a campaign run got
adapter_errorbecause a vendor's API was flaky that day, that's the truth — not retried, not filtered out, not silently dropped. The report shows it. Same principle ascapability_mismatchrows: they're queryable categories, not noise.
Every published result must be regenerable from a small set of pinned inputs. The schema needs to make these queryable:
campaign_runs.config_toml— the full campaign config, stored verbatim.campaign_runs.master_seed— the seed that generated the design set.campaign_runs.adapter_versions— JSON map of each adapter's commit SHA at campaign start.campaign_runs.target_fleet_version— commit SHA of the target/DNS binaries.campaign_runs.started_at,ended_at— wall-clock dates.
Anyone reading a published comparison post should be able to clone the repo at the recorded SHAs, run the recorded campaign config with the recorded seed, and verify the numbers (modulo external-service flakiness on the day they re-run).
A new TOML schema parallel to scenarios/. Indicative shape:
id = "weekly-comparison-2026-q2"
description = "..."
duration = "24h" # campaign wall-clock cap; runner stops after this
seed = 42 # master seed; per-run seeds derive deterministically
[targets]
pool = ["bench-a", "bench-b", "probe-a"]
patterns = ["single", "two_random", "all"] # which host_pattern values to stratify on
[duration_buckets]
brief = { min = "30s", max = "2m" }
medium = { min = "2m", max = "10m" }
long = { min = "10m", max = "1h" }
[sampling]
samples_per_cell_default = 20
# High-discrimination tier: failure modes where service-to-service
# differences are expected to be small enough that n=20 won't reliably
# separate them. Extra samples buy tighter confidence intervals.
[[sampling.high_discrimination]]
failure_types = ["http_status", "tls_expired", "tls_expiring"]
samples_per_cell = 60
[[failure_types]]
type = "http_status"
status_code_choices = [503, 502, 504]
[[failure_types]]
type = "tcp_refused"
[[failure_types]]
type = "http_timeout"
phase_choices = ["ttfb", "body"]
delay_range = { min = "5s", max = "60s" }
[[failure_types]]
type = "http_redirect"
variant_choices = ["loop", "chain"]
[[failure_types]]
type = "http_body"
content_choices = ["empty", "error_page", "keyword_missing", "keyword_injected", "ransomware", "defacement", "malicious_script", "spam_links"]
keyword_choices = ["uptime-bench-canary", "HACKED", "BTC"]
[[failure_types]]
type = "tls_expired"
days_expired_choices = [1, 7, 30]
[[failure_types]]
type = "tls_expiring"
days_remaining_choices = [6, 13, 29]
[escalation]
probability = 0.20 # 20% of designs are multi-stage
stages_range = { min = 2, max = 3 }
inter_stage_range = { min = "30s", max = "5m" }
patterns = ["layered", "replacement", "recovery"]
[budget]
pingdom = { max_runs_per_hour = 10 }
uptimerobot = { max_runs_per_hour = 1 } # free-tier rate-limit
datadog-synthetics = { max_runs_per_hour = 30 }
better-uptime = { max_runs_per_hour = 5 }
jetmon-v1 = {} # unlimited (self-hosted)
[cooldown]
per_target_minimum = "10m" # don't hit the same target more often than thisA single run with multiple chained failures, not a sequence of separate runs. Examples the model needs to express:
- Layered: DNS slow at t=0 → HTTP 503 joins at t=2m. Both active until run end.
- Replacement: HTTP 503 at t=0 → escalates to TCP refused at t=2m. Stage 1 ends when stage 2 begins.
- Recovery test: failure at t=0..t=2m → silence until t=5m → second failure at t=5m..t=7m. Tests whether the monitor cleared the first incident before the second arrived.
The scenario format's [[failures]] blocks now support both offset and per-failure duration, so layered, replacement, and recovery patterns are representable without a new stage abstraction. The campaign generator accepts [escalation].patterns = ["layered", "replacement", "recovery"]; omitted patterns defaults to ["layered"] for compatibility. Reports label multi-stage runs by escalation pattern and stage order. Remaining escalation work is deciding which pattern mix to use for published benchmark configs.
The campaign generator produces two artifacts deterministically from the master seed:
- Design set — K scenario designs. Each design is a fully-specified (failure params, host set, escalation timing) artifact. Designs are written to
campaign_runs.designsfor audit. - Schedule — for each design, a list of N replay times distributed across the campaign duration. Distribution isn't strictly random; it's quasi-random with constraints that no two replays of the same design fall in the same hour-of-day bucket, no two replays of any design overlap on the same target within
cooldown.per_target_minimum, and replays are spread across weekday/weekend if the campaign spans both.
Each replay invokes the existing single-scenario Run() once — campaigns are an orchestrator over many ordinary runs, not a new scenario shape. The per-replay scenario_runs.parameters records the design-id and replay-index for join-back.
Pure function: (config, masterSeed) → ([]Design, []ReplayPlan). No I/O. Heavily unit-testable; fix-seed → fixed design set and schedule. The generator runs once per campaign; the runner consumes its output.
For each design, generation proceeds as:
- Pick a cell
(failure_type, duration_bucket, host_pattern)from the cell list (cells are enumerated, not randomly sampled — every cell gets its declared sample count). - Pick a duration uniformly within the bucket's range.
- Pick failure-specific params (status code, phase, delay…) randomly within their declared choices.
- Pick a host set matching the host_pattern (single random target, two random targets, all targets, …).
- With
escalation.probability, append additional stages on the same scenario using one of the configured escalation patterns. - Assign a per-design seed derived from
masterSeed XOR designIndex.
For each design's replays, schedule generation picks N times within the campaign duration that satisfy the distribution constraints above. The schedule is pinned at campaign start, not generated lazily, so the audit trail shows "this design was supposed to run at times T1…TN" even if the campaign was interrupted.
- New CLI flag:
-campaign=<config.toml>mutually exclusive with-scenario=…. - Campaign loop: walk the schedule (pre-generated and time-sorted), run each replay via the existing
Run(), advance to the next scheduled time. - Per-target cooldown enforced at scheduling time (not at execution): the schedule generator already respects
cooldown.per_target_minimum. - Failure isolation: one bad scenario (adapter error, target unreachable) records its
resolution_reasonand the campaign continues with the next scheduled replay. The bad row is kept, not retried. - Persist a
campaign_runsrow at campaign start with the audit-trail columns above; updateended_atat campaign end.
A new cmd/uptime-bench-report tool that aggregates derived_metrics for a campaign:
uptime-bench-report -campaign=weekly-comparison-2026-q2
# Bias self-checks (printed first):
# - sample counts per service (flagged if any deviation > 5%)
# - sample counts per failure/service cell, with missing cells counted as 0
# - any capability_mismatch counts or uncategorized Unknown rows
failure_type | service | n | tp_rate | tp_rate_ci95 | cap_mismatch | min_s | avg_s | p50_s | p50_ci95_s | p95_s | p95_ci95_s | max_s
http_status | pingdom | 60 | 0.98 | 0.91-0.99 | 0 | 41.0 | 72.0 | 68.0 | 55.0-80.0 | 120.0 | 95.0-180.0 | 180.0
http_status | uptimerobot | 60 | 0.96 | 0.88-0.99 | 0 | 62.0 | 98.0 | 95.0 | 80.0-120.0 | 145.0 | 110.0-220.0 | 220.0
...Output formats: human-readable table (default), TSV, JSON. Backed by SQL queries joining scenario_runs ↔ derived_metrics filtered on the campaign's run-id list.
Per (failure_type, service) statistics:
- Detection rate (true_positive / (true_positive + false_negative), excluding capability_mismatch, maintenance_suppressed, cooldown_suppressed, cooldown_uncertain, and TLS advisory outcomes).
- Detection latency min/max/avg/p50/p95, with 95% confidence intervals for p50 and p95.
- False-positive rate.
capability_mismatchcount (separately surfaced; not folded into detection rate).- Sample count (so readers can judge meaning of the percentiles).
- ✅ Campaign config format + parser —
internal/campaignpackage, parser + validator. Tests incampaign_test.go. - ✅ Pure design + schedule generator —
(config, masterSeed) → (designs, schedule).internal/campaign/generator.go; deterministic, fixed-seed regression coverage ingenerator_test.go+no_favoritism_test.go. - ✅ Schema migration for
campaign_runs—schema/003_campaign_runs.sql;campaign_idFK onscenario_runs.db.InsertCampaignRun/CloseCampaignRunshipped. - ✅ Runner outer loop (serial) —
runner.RunCampaignwalksPlan.Schedule, calls existingRun()per replay viaWithCampaignRunID, rejects unsupported multi-host patterns, and rejects generated schedules that exceed any active service'sbudget.max_runs_per_hourbefore inserting a campaign row or calling vendor APIs. Per-replay errors don't abort the campaign. Tests ininternal/runner/campaign_test.go.cmd/harnessaccepts-campaign=<config.toml>as a mutually exclusive alternative to-scenario; campaign mode runs every enabled service fromservices.toml. Metrics are derived in one batch at campaign end viameasurement.DeriveCampaign, keyed byscenario_runs.campaign_id. - ✅ Initial
cmd/uptime-bench-report— campaign metrics can be summarized fromderived_metricsinto table / TSV / JSON output. Current scope: per-(failure_type, service) samples, detection rate, TP/FN/FP/Unknown/maintenance/cooldown/TLS advisory counts, and latency min/avg/p50/p95/max. - ✅ Full report statistics — table/JSON reports now include bias self-checks, Wilson 95% detection-rate intervals, deterministic nearest-rank percentile intervals for p50/p95, and explicit
capability_mismatchcounts frommonitor_reports.reason_code. TSV stays row-only for scripts but includes the additional columns. - Escalation support — per-failure
durationoverrides and generator pattern sampling now cover layered, replacement, and recovery representations. Reports use campaign replay metadata to label multi-stage shapes by pattern and stage order.configs/campaign/example.tomlcarries a runner-safe starter mix for single-target campaigns; remaining work is settling the exact pattern mix and sample depth for published benchmark configs after live data confirms the noise floor.
Each phase is independently mergeable. Phases 1–5 deliver the "campaigns work, no escalation" milestone — that alone produces useful comparison data.
- Budget interplay with vendor cooldowns: the generator now enforces
cooldown.per_target_minimumin the replay schedule and rejects infeasible schedules.RunCampaignalso preflights the generated schedule against each active service'sbudget.max_runs_per_hour; a violation fails before acampaign_runsrow is inserted or vendor APIs are called. Vendor-side alert cooldowns can still matter when an adapter cannot guarantee clean state. Campaign replays requireSupportsCooldownReset = true; adapters where it's false get gated ascapability_mismatchfor the campaign's runs. If a retrieve still carries cooldown metadata, measurement emitscooldown_suppressedorcooldown_uncertainrather thanfalse_negative. Currently all probe-based adapters with Phase B set this true (delete-recreate cycles state); Jetmon-v1 needs bridge work to do the same. - Concurrent and multi-host execution: a 1,000-run campaign at ~8 minutes per scenario is ~133 sequential hours. Campaigns must run scenarios concurrently across non-overlapping (target, service) pairs, and multi-host cells need a scenario representation that can express more than one target. The runner currently runs one single-target scenario at a time end-to-end and now rejects multi-host campaign patterns before execution. Concurrent/multi-host campaign mode is an explicit extension. Open question for the design pass: where the parallelism axis lives (per-target, per-service, per-(target,service) pair). Single-scenario mode remains serial.
- Reproducibility under randomness: every campaign records its master seed and config in
campaign_runs. Re-running with the same seed against the same fleet+adapter versions produces the same design set and schedule. The project's existing reproducibility invariant scales to campaigns. - Per-stage keyword config (deferred): the scenario format carries one
Keyword+KeywordCheckpair per run. An escalation that mixes akeyword_injectedstage with a non-injected http_body stage (ransomware,defacement,keyword_missing, …) collapses toKeywordCheck="absent"with the injected keyword, silencing the canary-missing signal the non-injected stage was meant to measure.applyHTTPBodyDefaultsdocuments this. Instrumentation is in place: campaign replays writecampaign_mixed_content_escalationintoscenario_runs.parameters, andRunCampaignlogs the count of generated designs with this shape at campaign start. The fix is per-failureKeyword/KeywordCheckfields and adapter rework to switch keyword config mid-run — most adapters configure once at Provision and can't. Only schedule that schema change if real campaign data shows mixed escalations are >5% of designs in practice.
When campaign data is published, the methodology section must include, at minimum:
- The full campaign config (TOML) and master seed.
- Per-cell sample counts (showing where high-discrimination depth was applied and where it wasn't).
- Adapter and target-fleet commit SHAs.
- Total wall-clock duration and any campaign interruptions.
- Confidence intervals on all reported percentiles.
- The full count of
capability_mismatch,adapter_error,maintenance_suppressed,cooldown_suppressed,cooldown_uncertain,tls_advisory_detected,tls_advisory_missed, andtls_advisory_false_outageoutcomes per service — these are part of the data, not filtered out. - Both sample-weighted service scores and scenario/category-normalized service scores so uneven scenario mixes do not hide a provider weakness behind a high-volume easy category.
- The percentile method used.
cmd/uptime-bench-reportuses the nearest-rank convention (idx = ⌈p·N⌉ − 1on the sorted sample, NIST / Wikipedia "C = 1"). Different from R's defaultquantile()(type 7, linear interpolation) and numpy's defaultpercentile(), which produce slightly different numbers for the same data. Nearest-rank always returns an observed sample value — the published p95 is a number that actually occurred in the campaign — but skeptics recomputing with a different method will see ±1-bucket drift. - The aggregation depth.
cmd/uptime-bench-reportaccepts either a concretecampaign_runs.id(one run) or a stablecampaign_idfrom the campaign TOML (every matching run aggregated). The report header line discloses which interpretation matched and how many runs were folded together — quote that line in any published post so readers know the numbers span N runs, not 1. - The exact finalization command used. Re-running finalization should recompute derived metrics and refresh
report.md, machine-readable JSON, and the report manifest without hand-editing output files.
The methodology choices (which failure types are high-discrimination, what the sample-count target is) are explicit human judgments and should be argued for in any published post — different operators will care about different failures, and a transparent methodology lets them re-weight from the raw data if their concerns differ.
Status: Implemented. Design draft at inter-run-state-design.md, 2026-04-26; scenario parsing, runner capability gating, adapter ProvisionConfig.MaintenanceWindow, vendor-side maintenance APIs, and maintenance_suppressed metric classification are in place.
Monitors commonly support scheduled maintenance windows during which alerts are suppressed. Testing whether a monitor correctly silences alerts during a declared window is a meaningful accuracy dimension — a monitor that still alerts during maintenance produces false positives; a monitor that never alerts afterward may have also cleared state it shouldn't have.
Implemented pieces:
ProvisionConfig.MaintenanceWindowschedules absolute vendor-side suppression windows.- Runner gates scenarios with
[maintenance]againstSupportsMaintenanceWindows; unsupported adapters producereason_code = "capability_mismatch"and are not provisioned. - The measurement model emits
maintenance_suppressedwhen a failure is active, the adapter returned Known with no reports, and the maintenance window covered at least 80% of the failure period. Correct behavior, not a false negative. - Scenario TOML supports a
[maintenance]block with relativestart_offsetandduration. Natural patterns:- Overlapping: window covers the failure entirely (tests "alerts suppressed during maintenance").
- Edge: window ends midway through the failure (tests "alerts fire as soon as window closes, even though failure was already active").
Per-vendor implementation status:
- Pingdom — creates a one-shot maintenance window with the check attached; unit-covered.
- UptimeRobot — creates and attaches a maintenance window; unit-covered.
- Datadog Synthetics — creates downtime scoped to the synthetic monitor id; unit-covered.
- Better Uptime — patches monitor pause/maintenance fields for the requested window; unit-covered.
- Jetmon v2 — patches
maintenance_start/maintenance_end; unit-covered and live API-contract covered. - Jetmon v1 — still no first-class bridge/API support; maintenance scenarios gate as
capability_mismatch.
Remaining follow-up: run true live fail-during-maintenance scenarios against each vendor to verify suppression behavior, not just API request shape and metric classification.
Status: Active validation item. Implementation and unit/API-shape coverage are in place; true live fail-during-maintenance behavior still needs to be observed.
After the Jetmon v2 baseline scenario smoke passes, run targeted maintenance-window scenarios against Jetmon v2 first, then broaden to vendors that claim SupportsMaintenanceWindows. The validation should cover at least one fully-covered maintenance window and one edge case where the maintenance window ends while the failure is still active.
Acceptance:
- Suppressed alerts during fully covered maintenance windows become
maintenance_suppressed, not false negatives. - Failures that continue after the maintenance window closes still produce alerts when the service supports that behavior.
- Unsupported services remain explicitly gated as
capability_mismatch.
Status: Implemented for adapters that can guarantee clean state; Jetmon v1 live validation is blocked before cooldown can be tested. Design draft at inter-run-state-design.md, 2026-04-26. Capability flags are wired, delete/recreate adapters claim reset support, Jetmon v2 disables alert cooldown at provision time, Jetmon v1 write mode claims reset support through the bridge's POST/DELETE state reset semantics, campaign scheduling enforces per-target spacing, campaign replays gate adapters without SupportsCooldownReset, and the measurement engine emits cooldown_suppressed / cooldown_uncertain outcomes from retrieve metadata.
Most monitors suppress repeated alerts for the same site within a cooldown window (commonly 30 minutes). When uptime-bench runs multiple consecutive scenarios against the same provisioned monitor, the second run's alert may be suppressed by the cooldown from the first — producing a result that looks like a missed detection but is actually the monitor working correctly.
The correct approach in Deprovision: if the service's API supports resetting alert state or the cooldown clock, do so. Otherwise, delete and recreate the monitor (accepting the cost of full reprovisioning).
If the API doesn't support either: record the cooldown state at the time of Retrieve and include it in MonitorReport.Metadata. The measurement engine then classifies the suppressed result separately rather than as a false negative.
Current implementation status:
- Pingdom — delete/recreate cycles check state;
SupportsCooldownReset = true. - UptimeRobot — delete/recreate cycles monitor state;
SupportsCooldownReset = true. - Datadog Synthetics — delete/recreate cycles the synthetic test and attached monitor state;
SupportsCooldownReset = true. - Better Uptime — delete/recreate cycles monitor and incident state;
SupportsCooldownReset = true. - Jetmon v2 — adapter provisions sites with
alert_cooldown_minutes = 0;SupportsCooldownReset = true. - Jetmon v1 — write-mode bridge runs reset
site_status/last_status_changeon POST and DELETE, soSupportsCooldownReset = trueonly whenauth.write_mode = "true"; read-only bridge runs still gate ascapability_mismatch.
Each adapter must document how it handles this in its implementation notes.
Live validation attempt on 2026-04-28:
- Deployed the current harness binary to the harness host so
-monitors=jetmon-v1can target the Jetmon v1 adapter alone. - Applied the append-only
003_campaign_runs.sqlmigration to the deployed harness database;002_reason_code.sqlwas already present. - Confirmed the Jetmon v1 bridge is reachable from the harness host and accepts the configured bearer token.
- First
http-503run failed before cooldown could be tested because Jetmon v1 bridge write mode returned500fromPOST /monitors; the run6e737b679e6052851a34ea3ade1ad2a9closed asadapter_error. - The target failure deactivated normally and the target control registry was clean afterward.
Remaining follow-up: fix or redeploy the Jetmon v1 bridge write-mode POST /monitors path, then run a back-to-back Jetmon v1 write-mode scenario pair to confirm reset behavior with the real worker loop.
Status: MVP implemented. cmd/probe-ips-refresh fetches public vendor probe lists, normalizes IPs/CIDRs, and emits a reviewable TOML fragment. make refresh-probe-ips runs the tool locally, and a weekly GitHub Action opens a PR with the latest generated fragment. Manual review is still required for vendor feeds that lack stable regional metadata.
services.toml's [services.probe_ranges] blocks list per-region CIDRs for each vendor's published probe pool. Vendors update these lists periodically. Each vendor publishes (or doesn't) in machine-readable form:
| Service | Source | Format | Region tags |
|---|---|---|---|
| Pingdom | https://my.pingdom.com/probes/ipv4 |
Plain text currently; parser also tolerates JSON | No in current feed |
| UptimeRobot | https://uptimerobot.com/inc/files/ips/IPv4andIPv6.txt |
Plain text | No |
| Datadog Synthetics | https://ip-ranges.datadoghq.com/ |
JSON, synthetics key |
Yes |
| Better Uptime | https://betterstack.com/docs/uptime/frequently-asked-questions/ |
HTML | Partially (in prose) |
Build: cmd/probe-ips-refresh Go tool fetches each list, normalizes regions where possible, and emits a TOML fragment to stdout. Operator pipes it to a file, diffs against the current services.toml, and applies changes by hand.
Why not auto-write services.toml: vendor changes can include unexpected region renames, IPv6-only additions, or removals that should be noticed, not silently merged. Manual review is the safety check.
Region mapping:
- Datadog tags each IP/CIDR with a provider location; the tool folds those into coarse uptime-bench regions such as
us-east,eu-west, andap-sea. - Pingdom's current public IPv4 feed is untagged plain text. Decision: keep Pingdom under
globalunless Pingdom publishes stable region tags or an operator-maintained map with verified provenance; the tool emits a warning so operators can keep the all-probe pool fresh without pretending it is region-specific. - UptimeRobot doesn't tag regions. Maintain a hand-edited
internal/probeips/uptimerobot_regions.jsonthat maps IP prefixes to regions; the tool warns when a new IP doesn't fall in any known prefix. Keep this file in the repo so updates are visible in PRs. - Better Uptime publishes IPs in HTML prose with broad region annotations; the tool parses the current FAQ page best-effort and warns that mappings need review.
Remaining: seed the UptimeRobot region map with verified prefixes. Pingdom stays global until an official tagged feed or a verified curated map exists.
Status: Lower-priority expansion backlog after Uptime Kuma and Gatus. These are useful comparison points, but each needs either heavier infrastructure or a less direct adapter model.
Recommended order:
- Prometheus + blackbox_exporter + Alertmanager — best self-hosted reference baseline for probe-level behavior. Supports HTTP/HTTPS, DNS, TCP, ICMP, and gRPC probes with detailed timing, TLS, and cert-expiry metrics. Treat this as an observability-stack baseline rather than a product-style uptime monitor, and pin the rule/alerting configuration in report metadata.
- Zabbix — mature self-hosted monitoring with web scenarios, response-code checks, response-time data, string checks, triggers, and history. Valuable because it is widely deployed, but adapter work is heavier.
- Monika — CLI/config-driven synthetic monitoring with HTTP/TCP probes and flexible assertions over status, body, headers, timing, and size. Likely needs webhook or log ingestion for clean event retrieval.
- Statping-ng — lightweight uptime/status-page monitor with a REST API and HTTP/TCP/UDP/ICMP/gRPC coverage. Useful, but smaller ecosystem and lower priority than Uptime Kuma/Gatus.
Deferred for different scenario lanes:
- Upptime — useful zero-infrastructure/GitHub Actions monitor, but the normal five-minute cadence is a poor fit for minute-level detection comparisons. Consider later as a distinct "free/CI-backed monitor" category.
- Healthchecks.io self-hosted — reverse heartbeat/dead-man-switch semantics belong with future heartbeat and agent-based reverse-check scenarios, not the current probe-based uptime matrix.
Status: Initial observability path, generated target DNS support, guarded
Jetmon lifecycle automation, and scenario-run capacity report artifacts are
implemented on trunk.
The first capacity track compares Jetmon v1 and Jetmon v2 as active monitor count grows. It is intentionally separate from scenario accuracy campaigns: scenario runs answer whether monitors detect controlled failures, while capacity runs answer how resource use, check timeliness, lifecycle throughput, and service health scale with batch size.
Implemented:
cmd/uptime-bench-capacitysummarizes Prometheus range windows forjetmon-v1.example.comandjetmon-v2.example.com.cmd/uptime-bench-dockerstats-exporterexposes Docker API container stats as Prometheus metrics for hosts where cAdvisor cannot identify Docker 29overlayfs/ containerd-snapshotter writable layers.- The exporter is deployed on both Jetmon hosts at
203.0.113.170:9103and203.0.113.171:9103. fleet.tomlsupports[[targets.generated_sites]]ranges so DNS can resolve million-scale synthetic hostnames without expanding all hosts into the zone map.cmd/uptime-bench-targetloadcan probe generated host ranges against DNS and HTTP before those hosts are loaded into Jetmon, with Markdown output for saving HTTP-only and DNS-path target capacity reports.cmd/uptime-bench-jetmon-capacity-runcan execute guarded Jetmon v1/v2 capacity lifecycle batches from a private fleet config.- Capacity
run-suiteinvocations persist the last completed, last clean, and first problem batches and resume from the last clean batch by default;-full-suiterestores a complete first-batch-to-last-batch pass, while-batch-sizes,-duration, and-cooldownsupport quick scout passes. make capacity-jetmon-scoutprovides the standard 1k/5k/10k Jetmon capacity ladder so operators can run the next scalability gate without reconstructing the command by hand.- Capacity
run-suitedirectories includecapacity.mdandcapacity.jsonrollups for batch pass/fail status, DB health, missed-check threshold status, freshness throughput margin, thresholds, Prometheus highlights, last clean batch, and first problem batch. - Capacity live batches validate exact activated target URL samples before the timed window starts, so a generated DNS/URL pattern mismatch fails as target setup rather than being misread as Jetmon missed checks.
uptime-bench-finalize -capacitywritescapacity.mdandcapacity.jsonalongsidereport.md/report.json, using the finalized campaign window as the Prometheus query range.docs/capacity-benchmark.mdrecords the test shape, stop thresholds, target direction, and bulk lifecycle approach.
Remaining follow-up:
- Stress test target-side DNS and HTTP capacity before monitor-side million-site runs.
- Extend target preflight to run exact activated URL checks from each Jetmon service host and Veriflier host, not only from the capacity runner host.
- Use the guarded Jetmon lifecycle runner for staged active-monitor growth
suites and compare suite
capacity.mdfindings against detection behavior.
Status: Runner support done; measurement engine matching rule pending. Punted — current "earliest active failure" rule produces sound results for the simultaneous-failure scenarios that exist today.
The offset field is honored by the runner: failures activate at scenario_start + offset and run for duration from that point (internal/runner/runner.go:scheduleFailureEvents). Ground-truth events emit independent failure_start / failure_end pairs per failure with the correct timestamps.
Still to design (when staggered scenarios become a focus):
- Measurement engine: detection latency is calculated against the first failure window an alert falls inside (
internal/measurement/measurement.go). When failures are staggered and overlapping, "which failure did the monitor respond to?" matters for accurate latency attribution. Today's "earliest active failure" rule loses signal when multiple layers fail together (e.g., DNS at t=0, HTTP at t=30, alert at t=45 — was the monitor responding to DNS or HTTP?). - The matching rule should probably be: the failure whose normalized classification best matches the monitor's reported classification, falling back to earliest-active when classification doesn't disambiguate. Spec it before implementing.
Status: Partially implemented. http_method_status covers the two high-priority HEAD/GET status mismatches: HEAD failure with healthy GET, and healthy HEAD with GET failure. Measurement now scores http_method_status method="HEAD" as a healthy-GET false-down trap rather than as a missed outage when no alert fires, while method="GET" remains a visitor-visible outage. The target also honors optional method = "GET" / "HEAD" predicates for http_redirect, http_timeout, http_partial, and http_body, with shipped scenarios for GET-only redirect loops, GET-only truncated bodies, GET-only TTFB stalls, and HEAD-only TTFB stalls. One custom-header status variant is implemented; broader user-agent, accept-header, auth, and body-shape divergence remains deferred until the method cases produce more benchmark data.
These are expected Jetmon-v1 pitfalls if it relies on shallow HEAD/status checks, and they should become Jetmon-v2 regression cases if v2 probes the user-visible GET path:
- Method-scoped redirects: HEAD returns 200 while GET enters a redirect loop. Inverse cases such as GET healthy but HEAD redirected/challenged, wrong-host redirects, and HTTPS downgrade redirects remain future variants.
- Method-scoped latency and truncation: HEAD returns quickly with 200 while GET stalls before first byte or closes mid-response; the inverse HEAD-stalls/GET-healthy case is also represented for TTFB stalls. Body-phase stalls remain future variants.
- Request-header divergence: the origin, WAF, cache, or bot protection serves different status/content for monitor-specific
User-Agent,Accept,Accept-Language, auth headers, or missing browser-like headers. The checked-inhttp-header-statusscenario covers a deterministic custom-header status failure; richer request-shape divergence can create either false-up or false-down results depending on which headers the monitor actually sends.
Implementation shape:
- Extend the target failure matcher beyond
(type, host, path)to include optional request predicates.methodand one custom-header status path are implemented; user-agent substring and richer header predicates remain deferred. - Add method/header-scoped variants for
http_redirect,http_timeout,http_partial, and selectedhttp_bodyscenarios once the matcher can express them cleanly. Method-scoped target support exists now; remaining work is adding richer header-scoped variants and deciding which method/body combinations deserve campaign weight. - Keep the existing content scenarios as the baseline for "GET body is bad while HEAD/status looks fine"; those already cover ransomware, defacement, malicious script, SEO spam, keyword missing, and keyword injection.
Status: Not implemented. Not key right now — documented for future reference.
Some vendors don't make their probe IP list easily discoverable: buried in support docs, gated behind login, or not published at all. The benchmark depends on knowing these IPs to inject geographic failures (http-geo-503 only fails for matching source IPs). For vendors that don't publish, options are:
- Sniff probe IPs by logging connections to
:80/:443over a long observation window and clustering by frequency. - Skip geo scenarios for that vendor and record the limitation.
- Lobby the vendor for a published list via support.
Worth a tracking note in case a target vendor goes dark on this; the Probe IP CIDR refresh tool entry above assumes vendors continue to publish.
Status: Not implemented. Punted.
Some monitors (including Jetmon) record per-component timing breakdowns: DNS resolution time, TCP connection time, TLS handshake time, time to first byte. This data exists in their APIs but is not retrieved by the adapter and not stored in monitor_reports.
What it would enable:
- Verifying that
dns_latencyfailures increase DNS resolution time specifically, not TCP or TTFB. - Verifying that
http_timeout phase=ttfbfailures appear in the TTFB component, not the DNS component. - Layer-level attribution accuracy: does the monitor correctly identify which layer is slow?
What it would need:
MonitorReport.Metadataalready exists asmap[string]any. Adapters that retrieve timing breakdowns should populatedns_ms,tcp_ms,tls_ms,ttfb_mskeys.- A new metric in the measurement engine:
timing_layer_attribution.
Status: Not implemented. Punted.
Some monitors track a site's expected redirect chain and alert when it changes — distinct from alerting on broken redirects (loops, excessive hops). Real failure mode: a site that normally redirects HTTP → HTTPS suddenly redirects to a different domain (compromised DNS, misconfiguration).
Simulating this requires:
- A per-site "normal redirect" configuration in
fleet.toml(the path that redirects, and where it redirects to in the healthy state). - A new failure type
http_redirect_changewith atofield specifying the altered destination. - The target serving the configured redirect during healthy operation and the changed redirect during the failure window.
- Adapter provisioning with redirect-tracking enabled, since most monitors require explicit opt-in.
Status: Not implemented. monitor_kind = "heartbeat" is reserved in the scenario schema; provisioning and target-side sender support are deferred until a first adapter implementation is selected.
Heartbeat monitoring (dead-man's switch) and agent-based checks (wp-cron, scheduled task monitoring) require the monitored system to actively send signals to the monitor, rather than the monitor probing the site.
uptime-bench's target fleet is currently passive — it responds to probes. Simulating heartbeat failure requires:
- A heartbeat sender process on the target server that pings a monitor's ingest endpoint on a schedule.
- A control command (
heartbeat_stopped) that pauses the sender for the failure window. - Adapter support to provision a heartbeat monitor (endpoint URL, expected interval).
This architectural extension should be designed when the first monitor service ships heartbeat support. The control API and scenario schema are designed to accommodate new failure types without breaking changes, and the monitor-kind field can now route such scenarios away from HTTP-only adapters.
Status: Research-grade stretch goals. Not scheduled. Some may be impossible to run safely without sacrificial domains, owned network resources, commercial provider partnerships, or a dedicated isolated lab.
The current benchmark mainly controls the target, authoritative DNS, TLS certificates, and HTTP behavior. That already covers a large set of real outages, but it does not fully model failures where the internet control plane, registrar, edge provider, transit path, or monitoring provider control plane fails around an otherwise healthy target.
These tests are valuable because they answer a different question: "Does the monitoring service understand this external dependency or global failure mode, or does it only notice the final HTTP symptom?" They should remain separate capability axes. A provider should not be scored as a false negative for a registrar, BGP, browser, CDN, or heartbeat scenario unless the adapter explicitly provisions that capability and the service claims to support it.
Candidate stretch tracks:
- Registrar and registry failures: domain expiration, domain approaching expiration,
clientHold/serverHold, registrar parking pages, registrar lock changes, RDAP/WHOIS lookup failures, and registry-side status drift. These require sacrificial domains or registrar/API control and should never risk production-like domains. - Parent-zone delegation failures: parent NS records diverge from child-zone NS records, glue records are missing or wrong, DS records are broken, or all parent-delegated nameservers become unreachable. This tests delegation awareness rather than ordinary authoritative DNS behavior. It likely needs real delegated test domains or a controlled resolver/registry simulation.
- DNSSEC failures: bogus signatures, expired RRSIGs, broken DS/DNSKEY chains, NSEC/NSEC3 edge cases, and validation failures that only DNSSEC-validating resolvers see. This requires DNSSEC-capable authoritative infrastructure and careful separation between native DNS monitors and HTTP monitors that only observe downstream lookup failure.
- IPv6-specific reachability: AAAA exists but IPv6 is unreachable while IPv4 works, IPv6 TLS differs from IPv4 TLS, or IPv6 latency is pathological. This is more achievable than BGP work, but it requires an IPv6-capable fleet and report metadata that records which address family a monitor used.
- Network path and ASN partitions: failures scoped to one monitor region, one cloud provider, one ASN, one country, or one known probe CIDR group. The practical version is target-side firewall filtering using verified probe IP metadata. The risky version is real route manipulation.
- BGP and transit failures: destination prefix withdrawal, route leaks, blackholes, route flapping, upstream transit provider outage, or nullrouting by a major ISP. These are high-value but likely require an owned prefix/ASN, a network lab, or provider partnerships. They should not be attempted on shared production networks.
- CDN and edge-provider failures: origin down while CDN serves stale 200s, CDN-branded 52x responses, one edge POP serving stale or poisoned content, cache-key bugs exposing another tenant/user view, or edge-only TLS/cert mismatch. This probably needs either a real CDN sandbox or a benchmark-owned "edge simulator" in front of the target.
- WAF, bot-protection, and reputation failures: monitor probes get 403, 429, or JavaScript challenges while ordinary browser traffic succeeds; monitor-specific user agents are blocked; or a provider's probe IPs are reputation-blocked. This builds on request-header divergence but needs reliable probe IP/source metadata and more request-shape controls.
- Multi-region quorum and disagreement: one region fails, a minority of regions fail, all regions fail, or regions disagree on content. This should measure whether a provider alerts on one failed probe, a quorum, or all probes. It needs adapter-level location selection and report metadata for probe location/quorum policy.
- Browser/runtime correctness: HTML returns 200 but a SPA fails to hydrate, JavaScript throws, client-side routing breaks, Core Web Vitals regress, or a critical third-party browser dependency fails. These belong in a separate browser-check track because their cost, flakiness, and semantics differ from single-probe uptime checks.
- Data consistency and stale-state failures: wrong vhost/tenant served, stale content persists after origin update, logged-in content is cached for anonymous users, or region-localized content is served to the wrong audience. These require baseline learning or paired comparisons rather than a single response assertion.
- Monitoring provider control-plane failure: monitor creation succeeds but result retrieval is delayed, provider APIs rate-limit or timeout, incident logs are incomplete, alert state is stale, or deprovision fails. This is not a target outage, but it is a real reliability dimension for the benchmark; results should be classified as provider reliability data, not detection failures.
Possible implementation strategy:
- Start with the least dangerous stretch tracks: IPv6-specific failures, regional/ASN firewall partitions, and CDN/edge simulation. These can mostly stay within benchmark-owned infrastructure.
- Add schema/report support before live tests: capability flags,
monitor_kindvalues where needed, address-family metadata, probe location/quorum metadata, and separate provider-control-plane metrics. - Treat registrar, parent-zone, DNSSEC, BGP, and real transit-provider tests as research projects. Each needs a written safety plan, rollback plan, and explicit non-production test assets before implementation.
- Keep all stretch-track results out of the ordinary uptime accuracy denominator unless the service capability is explicitly configured and verified.
Status: Not implemented. Deferred while the file-shipped fleet.toml model is sufficient.
The current model has every member that needs fleet topology — DNS for zones, certmint for DNS control URLs, and (in Phase B) targets for the certmint URL — read a copy of fleet.toml from /etc/uptime-bench/. When the operator changes topology, they edit the file on the harness and redeploy it to every member that consumes it. Same friction as today's DNS deployment.
The desired endpoint is a harness-mediated control API: the harness reads fleet.toml, exposes a small read-only HTTP endpoint (e.g. GET /fleet/topology), and every other fleet member polls it on a schedule. Operators edit fleet.toml once on the harness, restart the harness, and the rest of the fleet picks up the change without any further action. New hosts joining the fleet need only know the harness's URL and the shared control token — they don't need a synchronized copy of fleet.toml.
What's needed:
- An inbound HTTP server on the harness, behind the existing bearer-token middleware. The harness has none today; it's currently a pure client of the control plane.
- A polling client baked into target, dns, and certmint binaries, with cached last-known-good fallback so a brief harness outage doesn't take down dependents.
- Schema for the topology payload (subset of fleet.toml relevant to each role).
- A migration path: each role keeps the file-based fallback, gains the polling client, and the systemd units pass the harness URL via env when the operator opts in.
Why deferred: the file-shipped model does the job at our scale (single-digit fleet size, low rate of topology change). Centralizing wins clearly when (a) fleets get bigger, (b) topology changes more often, or (c) auto-recovery / dynamic fleet membership becomes a real requirement. Until then, a make deploy-dns; make deploy-certmint after a fleet.toml edit is acceptable friction. Worth pulling forward if any of those three pressures materialize.