All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project follows Semantic Versioning.
- Added Agent Browser Operator OS as a self-serve support route in README, CLI support output, package project URLs, GitHub funding metadata, and generated leaderboard docs.
- Added tests covering the browser operator route in CLI and leaderboard support surfaces.
- Release methodology and reproducibility documentation:
docs/METHODOLOGY.mddocs/REPRODUCIBILITY.md
- Minimal front-page README doc index linking leaderboard, methodology, reproducibility, submission, and pack data-card docs.
- Stabilized release metadata at
1.0.0. - Multi-seed analysis and reporting are now first-class release methodology:
- confidence intervals and significance marker,
- task/test strength gates,
- official multi-seed protocol and compute budget manifest,
- pack registry + external pack hashing,
- Docker sanity path and CI docker sanity job,
- leaderboard UI v2 artifact flow.
- Added deterministic
analyzeCLI subcommand:python -m mentor_worker_benchmark analyze --results <path> --out <json>- Computes multi-replicate means, 95% bootstrap CIs, lift CI, and paired bootstrap significance.
- Emits explicit provenance (
analysis_version,ci_method,bootstrap_samples,bootstrap_seed).
- Added support for optional
results.replicatespayloads to represent seeded reruns for the same benchmark config.
- Submission export now always bundles
analysis.jsonalongsideresults.json,environment.json, andsubmission_manifest.json. - Submission verification now:
- Requires valid
analysis.jsonwhenresults.replicatescontains multiple replicates. - Deterministically backfills analysis during verify when a single-replicate archive omits
analysis.json.
- Requires valid
- Community leaderboard normalization now surfaces CI/significance fields per submission:
baseline_mean,baseline_ci_low,baseline_ci_highmentored_mean,mentored_ci_low,mentored_ci_highlift_mean,lift_ci_low,lift_ci_high,lift_significant
- Legacy
best_workerBaseline/Mentored/Lift fields are preserved and now mapped to analysis means for backward compatibility. - Docs leaderboard UI now shows CI tooltips on Baseline/Mentored/Lift and a
sigmarker when lift CI excludes zero. - Added deterministic analysis and submission verification tests, including a two-replicate fixture.
- Bumped package version to
0.3.0.
- Refined leaderboard UI v2 in
docs/index.htmlgeneration: single-table flow with role/pack/suite/search/sort controls, highlight cards, and per-row commit copy action. - Kept docs generation deterministic with embedded summary JSON and retained fast UI-facing generator test coverage.
- Added fresh dated community submissions and refreshed normalized leaderboard artifacts (
leaderboard/summary.json,docs/leaderboard.md,docs/index.html).
- Fixed README/doc inaccuracies:
task_pack_v2quick split is30tasks, and CLI suite examples now includedev10. - Clarified official baseline policy:
- headline official numbers come from
dev/dev50/test - official
dev10/quickruns are sanity checks only.
- headline official numbers come from
- Improved community leaderboard normalization for legacy bundles missing newer summary fields.
- Backfills
total_passes, per-mode pass counts, model-call errors/timeouts from rawresults.runswhen available. - Emits explicit
metrics_sourcemetadata in normalized submission JSON. - Adds official-role labeling (
headlinevssanity) and updates docs rendering accordingly.
- Backfills
- Added regression tests for leaderboard legacy backfill and official-role classification.
- Bumped package version to
0.2.1.
- Local Ollama integration for mentor/worker chat loops and setup checks.
- Objective coding benchmark harness with patch application and pytest scoring.
task_pack_v1deterministic corpus (300 tasks) withtrain/dev/testandquicksplit.- Benchmark run modes:
worker_onlymentor_workermentor_only_suggestion_noisestronger_workermentor_swap
- Reproducibility mode (
--repro) with fixed generation settings and deterministic ordering. - Mentor constraint enforcement with violation detection, blocking, and logging.
- Patch safety checks and isolated test execution with per-task timeout.
- Result artifacts:
results/results.jsonresults/leaderboard.mdresults/schema.md
- CLI commands:
setuprunsanityleaderboardcompare
- CI workflow for tests, task-pack metadata/schema validation, and sanity subset checks.
- Leaderboard publishing utility (
scripts/publish_leaderboard.py) that also writesdocs/index.htmlfor GitHub Pages. - Project quality and community files:
CONTRIBUTING.mdCODE_OF_CONDUCT.md- GitHub release notes template config.