This list summarizes remaining work to reach a clean, shippable state after recent infra and docs improvements.
- Run a full end-to-end suite on the target dataset (MCQ full, choices-only, Cloze per
CLOZE_MODE, robustness tasks). - Populate
docs/results/report.mdusingscripts/fill_report.pyand verify all placeholders are filled. - Generate and check figures under
artifacts/figs/(paraphrase consistency, perturbation fragility, etc.). - Validate release artifacts via
bash scripts/validate_release.sh(no raw text or per-item exploit labels). - Optionally publish the logs viewer bundle and confirm it renders locally.
- Run full
pytestand ensure green (HF cloze smoke remains guarded byRUN_HF_CLOZE_SMOKE=0). - Optional: add a small assertion that
heuristics_summarykeys are always present insummary.json(defense-in-depth). - Optional: add an integration smoke that
scripts/run_evalset.shfinal summary includes configuredk.
- Unify provider-prefix normalization across all Inspect calls in
scripts/run_all.sh(benign pairs path follows the same normalization as others). - Consider echoing per-phase elapsed times in
scripts/run_all.shfor observability (consistency withrun_evalset.sh).
- Review internal links after the restructure; fix any missed references (anchors and paths).
- Keep cost/throughput heuristic single-sourced via
robustcbrn/utils/cost.py; avoid re-stating numbers elsewhere. - Optionally add a minimal “Docs Index” section at the top of
docs/README.mdwith the new structure outline.
- Follow
docs/safety/release-checklist.mdto finalize the public artifact set. - Tag a release once report, figures, and artifacts are validated.