Turn claims into verdicts grounded in measurement.
Project Telos | Gather | Crucible | Index | Forum | Telos
pip install crucible-bench
python examples/demo.pyOpen the visual cleanroom verdict surface at examples/crucible-demo.html.
Claims are cheap until a decision depends on them. crucible makes a thesis stand next to the measurement that could break it, and turns uncertainty into a verdict you can re-check.
Use it on a claim that needs to survive review, sponsor a domain oracle, or fund the cleanroom review path for harder evaluations.
-
Release:
crucible-bench 1.1.0; commandcrucible; Python 3.11+; zero third-party runtime dependencies in core. -
Operator surface:
crucible status --json,crucible doctor --json,crucible demo --json, andcrucible mcpexpose the Project Telos action envelope, the primary workflow commands, integration surfaces, and native MCP tools:crucible.status,crucible.doctor,crucible.assess, andcrucible.recheck. -
Current floor: 1.1.0 is the operator floor: one-command runs, cleanroom review packets, oracle replay templates, registry rechecks, and the native MCP bridge over the measurement -> verdict spine.
-
Enterprise readiness: docs/ENTERPRISE-READINESS.md records the large-context, action-receipt, readability, and host-integration contract for unattended agent workflows.
Ideas are cheap to assert and expensive to check. A claim gets repeated until it sounds true. A correction arrives quietly and never catches up. A theory's standing becomes a vibe rather than a record, and the loudest version wins. crucible is the organ that holds an idea to account.
It is the cognition counterpart to Gather. Where Gather brings evidence in and records how it was obtained (the afferent organ), crucible tests a thesis against that evidence and emits a verdict you can re-check (the efferent organ). You register a thesis as a set of claims, and for each claim the observation that would refute it. crucible steelmans the claims (proposing the test that would settle each), measures them against a substrate oracle, and writes a verdict per claim: MATCH, DRIFT, or UNVERIFIABLE. The verdict is grounded in the measurement, not in a judge's opinion, and it recomputes from the record, so a confident assertion has no effect on the rechecked result.
- Register a thesis with its claims and, per claim, its falsification condition.
- Steelman: independent adversaries propose the strongest refutation of each claim. They propose what to test; they do not decide.
- Measure: bind each claim to a substrate and a metric, and record the deviation from what the claim predicts.
- Refine the weakest axis: strengthen the substrate, sharpen the measurement, or amend the thesis, then re-iterate.
- Witness: a re-checkable verdict per claim (MATCH / DRIFT / UNVERIFIABLE), sealed so a reader can re-hash the stored record and catch inconsistent tampering. This is not an authorship signature.
The continuous part is the loop: substrates, measurements, and theses all improve across rounds, and the witnessed verdicts track which moved.
1.0.0 delivered the flagship floor: the full first loop plus drift tracking, Markdown assessment
reports, publication-gated export, registry operations, optional subprocess-backed seam adapters,
Telos witnessed-artifact interop, Gather/index protocol interop, measurement recheck descriptors,
batch assessment/report bundles, and clean verifier practice. The 1.1.0 branch adds operator run and
oracle recheck and cleanroom review commands over that spine. You register a thesis, steelman it
(adversaries propose the test), measure each claim against a substrate oracle, refine across substrate
rounds toward a
cohesively verified thesis, witness a re-derivable verdict per claim,
compare assessment rounds to see what held, moved, improved, or regressed, inspect a growing registry
by status, scope, and latest verified verdict, plug configured oracle commands into the steelman and
measure seams, consume telos.witnessed-artifact/v1 envelopes by re-running their named verifiers,
use sealed Gather digests as evidence, replay index verification records against supplied graph
packs, persist optional measurement replay descriptors for oracle-level checks, run a manifest of
thesis jobs into one registry, render witnessed assessments as readable Markdown reports, or run the
whole steelman -> measure -> assess -> recheck path as one cleanroom review packet. A fenced thesis
can be assessed locally, but the export edge refuses it by default.
A claim's standing is a verdict grounded in a measurement, not a judge's say-so. Steelman adversaries propose; the measurement decides. The decision is a pure function of the recorded measurement, with no model in the verdict step, so the verdict recomputes from the stored record and a fluent assertion has no effect on the rechecked result. UNVERIFIABLE is fail-closed: an axis that cannot be measured is never read as holding.
- A receipt on every claim. Each claim carries a sha256 of its content, so a tampered claim is caught by re-hashing.
- A grounded verdict, not a judgment call.
verdict_for(claim, measurement)is pure: a measurement within tolerance is MATCH, outside is DRIFT, absent or unmeasurable is UNVERIFIABLE. - A witnessed assessment out. An assessment folds its verdicts into one re-checkable seal that a downstream organ consumes.
- A clean verifier boundary. A verifier gets the original spec and the artifact. It does not need
the worker's context, reasoning trace, or intermediate steps. If success cannot be evaluated from
that minimal state, the spec is not checkable yet.
crucible run --bundlemakes that boundary concrete with a packet-level review note, andcrucible review BUNDLEvalidates the packet before handoff. - Stands alone, serves the constellation. crucible runs on its own with zero third-party dependencies and Null seams, and it composes with the other Telos organs (Gather's evidence, index's maps) as a peer through clean protocol contracts. Compose, do not absorb.
- Publication-gated. Theses and verdicts carry a disposition; fenced material is refused at the export edge by default. This is a mechanical disposition and marker guard, not semantic content classification. This public repository carries only self-contained, publishable examples.
When published:
pip install crucible-benchThe distribution is crucible-bench; it installs the crucible command and the crucible package
(import crucible). The core is pure standard library. From a clone:
pip install -e ".[dev]"From a clone, run several thesis assessments into one registry, with optional report files:
crucible batch examples/batch-binary-search.json --registry .crucible-registry --reports reportsA job names a thesis plus exactly one measurement source:
{
"jobs": [
{
"id": "binary-search-manual",
"thesis": "thesis-binary-search.json",
"measurements": "measurements-binary-search.json"
},
{
"id": "binary-search-substrate",
"thesis": "thesis-binary-search.json",
"substrate": "substrate-binary-search.json"
}
]
}For an operator session, run ties the loop together and records the witnessed assessment into a
registry before reporting the disk recheck:
crucible run examples/thesis-binary-search.json \
--measurements examples/measurements-binary-search.json \
--registry .crucible-registry \
--bundle reports/binary-search-run \
--jsonThe JSON run record includes thesis metadata, steelman refutations, the witnessed assessment, the
derived verdict rows, disk recheck status, and verifier packet artifact names. --bundle DIR creates
DIR/spec.json, DIR/run.json, DIR/report.md, and DIR/review.md with exclusive writes. Inside
the packet, artifact references stay packet-relative (. plus file names), and review re-checks
that path contract before handoff, so the verifier artifact
does not depend on the operator's local workspace path. The packet gives a verifier only the
original spec and artifact. Use --substrate instead of
--measurements to run through the table oracle in the same session shape.
Before handing the packet to a verifier, validate the cleanroom boundary:
crucible review reports/binary-search-run --jsonThe review check fails closed if the bundle is missing required files, carries extra context such as
notes or chat logs, omits the cleanroom verifier boundary, has a spec.json that no longer
matches the run record, has a report.md that does not render from run.json, has failed
embedded run integrity checks, rewrites run.json artifact paths away from packet-relative names,
or has review.md instructions that diverge from the cleanroom
verifier boundary.
Descriptor-bearing measurements can be inspected from the registry:
crucible recheck .crucible-registry --jsonTo hand the work to a verifier or oracle wrapper, write a replay pack template:
crucible recheck .crucible-registry --template replay-template.jsonThe template contains claim context, the original recheck descriptor, the sealed measurement row to
reproduce, and blank measurement fields for the verifier to fill. The assessment block binds a
returned pack to the thesis id, assessment seal, and measurement seal. A verifier or oracle wrapper
can then return a replay pack with the original descriptor and the reproduced measurement row:
{
"replays": [
{
"recheck": {"oracle": "telos:conservation", "verifier": "conservation"},
"measurement": {
"claim_id": "claim-id",
"claim_sha256": "claim-sha256",
"deviation": 0.0,
"tolerance": 0.1,
"method": "telos:conservation",
"measured_at": 1000.0,
"evidence": ["verifier reproduced certificate"]
}
}
]
}Run the replay check with:
crucible recheck .crucible-registry --pack replay.json --jsonThe replay pack does not decide the verdict. If it includes an assessment block, that block must
match the selected assessment before measurement replay starts. The pack only proves whether the
sealed descriptor-bearing measurement rows can be reproduced; the verdict still follows from the
stored measurement through verdict_for.
Crucible is at its 1.1 operator floor: the core loop is stable, the public CLI is covered, and the release branch has the one-command run, cleanroom review, oracle replay, registry recheck, and native MCP surfaces needed by the Project Telos five-flagship room. Development continues by adding sharper substrates and oracle edges without weakening the measurement -> verdict spine.
Shipped:
- The verdict spine: a pure
verdict_forreturning MATCH / DRIFT / UNVERIFIABLE from a measurement, with no model in the verdict step and UNVERIFIABLE fail-closed. - A content-hash receipt on every claim, and a thesis seal that binds the claims, the title, and the disposition (so the publication gate can trust the label).
- A witnessed assessment that persists its verdicts and measurements, so
verify_assessmentrecomputes the seals from the stored data andrecheck_assessmentre-derives each verdict from the thesis and the measurements: a verdict, margin, and grounds cannot be asserted, they must follow from the record. Summary counts are re-derived from verdict rows as part of verification, and the thesis disposition is carried in the assessment and verdict rows. - A content-addressed registry that re-verifies stored claims (MATCH / MISSING / CORRUPT), checks thesis seals (catching a swapped claim a body check would miss), rejects duplicate thesis ids with different seals, refuses symlinked storage paths, and refuses to load a tampered thesis.
- The steelman seam: independent adversaries propose the strongest refutation of each claim and the test that would settle it (they propose; the measurement decides). The Null default surfaces the claim's own falsification and invents nothing; custom edges plug in through the same API shape.
- The measure seam: a sound oracle that decides a claim against a substrate. The
TableMeasurecomputes each claim's deviation from a predicted value over a provided substrate (offline, no model); theNullMeasuredefault measures nothing (UNVERIFIABLE). The Telos verifier or a proof oracle for abstract math plugs in through the same shape, so the verdict stays grounded, never asserted. - Measurement rechecks: assessment rows persist and seal
measured_at, evidence, and optionalrecheckdescriptors.recheck_measurementslets a caller provide oracle replayers that reproduce stored measurement inputs from those descriptors. - Oracle replay CLI:
crucible recheck REGISTRY [--template FILE] [--pack FILE]lists descriptor-bearing measurement rows, writes replay pack templates for clean verifier handoff, and validates finished oracle replay packs against the sealed measurement rows without creating a second verdict path. - The refine loop: grade each claim's measured margin, compute harmonic-mean cohesion, reflect the weakest claim, and re-measure across substrate rounds until the thesis is cohesively verified or the budget is spent honestly. The loop reports the weakest claim instead of pretending a short thesis held.
- Drift tracking across witnessed assessments:
drift_track(previous, current)andcrucible drift REGISTRYcompare the latest two rounds and classify each claim as held, moved, improved, or regressed from the recorded margins. - Assessment reports:
render_assessment_reportandcrucible report REGISTRYrender a deterministic Markdown artifact with counts, seals, integrity checks, verdict dispositions, measurement evidence, and recheck descriptors. - Batch assessment:
crucible batch MANIFEST --registry DIR [--reports DIR]consumes a manifest of thesis jobs, records each assessment into one registry, and optionally writes one Markdown report per job. Manifest paths stay inside the manifest bundle, path-like missing refs fail closed, and reports use unique index-prefixed names with exclusive writes. - Operator runs:
crucible run THESIS --registry DIR (--measurements FILE | --substrate FILE)runs the null steelman, measurement, witnessed assessment, disk recheck, and optional Markdown/JSON artifact writes as one scannable session.--bundle DIRwritesspec.json,run.json,report.md, andreview.mdas a self-contained cleanroom review packet with packet-relative artifact references. - Cleanroom bundle review:
crucible review BUNDLEvalidates that a review packet contains only the allowed spec/artifact files, carries the verifier boundary, has matchingspec.jsonand run-record thesis metadata, has packet-relativerun.jsonartifact paths, has passing embedded run integrity checks, has areport.mdartifact that re-renders fromrun.json, and keepsreview.mdpinned to the cleanroom verifier instructions before verifier handoff. - Publication-gated export:
gate_check,export_guard,export_thesis, andcrucible export THESISrefuse fenced material and explicit restricted markers before emitting a public thesis contract. - Registry operations:
registry_stats,search_theses,prune_objects, andcrucible registry stats|search|prunesummarize the corpus, recall theses by scope/status/latest verdict, and prune orphan claim bodies only when explicitly applied after registry path guards pass. - Optional subprocess edges:
SubprocessSteelmanandSubprocessMeasurerun configured commands through bounded JSON stdin/stdout, reject shell strings, enforce timeouts, and stamp claim identity locally. By default they pass only a minimal environment, discard stderr, and actively terminate children whose stdout exceeds the configured response bound. The default seams remain Null and the verdict step still has no model in it. - Telos artifact interop:
TelosMeasureconsumestelos.witnessed-artifact/v1envelopes through a caller-provided verifier registry. The carried certificate is not trusted; the named verifier is re-run, mapped into the normalMeasurement->verdict_forspine, and stored with atelos:<verifier>replay descriptor. - Gather/index interop:
GatherDigestMeasureconsumes sealed Gather digests and checks that a claim's expected evidence receipt exists;IndexMeasureconsumesindex.verification/1records and replays their structural claims against supplied graph packs. Both map into the same normalMeasurement->verdict_forspine. - Readiness coverage: the bundled examples run through the public CLI under test, help output covers
the shipped command surface, and
docs/RELEASE-READINESS.mdrecords the 1.0 gate checklist, including the spec-plus-artifact-only verifier rule. - The
crucibleCLI:register,assess,steelman,measure,run,recheck,review,registry list|verify|stats|search|prune,refine,drift,report,batch,export,verdicts [--verify].
crucible is fair-source: the code is open to read, run, and build on, with commercial use reserved so the project can fund its own development. Copyright stays with the author. See LICENSE for the exact terms.
