Skip to content

Commit 5b48611

Browse files
docs: update documentation for SN benchmark and lifecycle features
Update AGENTS.md with sn mint (renamed from sn build), sn reset, sn clear, --reset-to flag, benchmark cache reporting, and 5-dimensional scoring. Add SN module paths to project-dev skill, LLM proxy note to service-ops skill, and SN key files table to engineer agent. Delete superseded plans 16, 17, 18. Mark Plan 19 complete.
1 parent 646e921 commit 5b48611

8 files changed

Lines changed: 65 additions & 597 deletions

File tree

.github/agents/engineer.agent.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,16 @@ When modifying LinkML schemas:
8484
- Documentation and prompt template updates
8585
- Refactors where the before/after is well-defined
8686

87+
### Commonly-Modified Areas
88+
89+
| Path | Purpose |
90+
|------|---------|
91+
| `imas_codex/sn/` | Standard name pipeline (mint, benchmark, graph ops) |
92+
| `tests/sn/` | SN test suite (mostly mock-based, no Neo4j required) |
93+
| `imas_codex/llm/prompts/sn/` | LLM prompt templates for SN |
94+
| `imas_codex/sn/benchmark_reference.py` | Gold reference set for benchmark scoring |
95+
| `imas_codex/sn/benchmark_calibration.yaml` | Calibration dataset for reviewer consistency |
96+
8797
## When to Escalate
8898

8999
If a task requires:

.github/skills/project-dev/SKILL.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,10 @@ git push origin main
8484
| `@pytest.mark.integration` | Full integration tests |
8585
| `@pytest.mark.unit` | Fast unit tests |
8686

87+
SN tests live in `tests/sn/` and run with `uv run pytest tests/sn/ -v`. They do not require
88+
Neo4j unless marked `@pytest.mark.graph` — the rest use mocks. Benchmark tests validate prompt
89+
parity with the mint pipeline, calibration dataset integrity, and reference set coverage.
90+
8791
## Project Structure
8892

8993
| Directory | Purpose |
@@ -95,7 +99,9 @@ git push origin main
9599
| `imas_codex/tools/` | MCP tool implementations |
96100
| `imas_codex/remote/` | Remote execution (SSH, scripts) |
97101
| `imas_codex/llm/` | LLM integration and prompt templates |
102+
| `imas_codex/sn/` | Standard name pipeline (mint, benchmark, graph ops) |
98103
| `tests/` | Test suite (mirrors source structure) |
104+
| `tests/sn/` | Standard name test suite (mostly mock-based) |
99105
| `plans/features/` | Active feature plans |
100106
| `agents/` | Agent documentation and schema reference |
101107

@@ -107,3 +113,11 @@ git push origin main
107113
- **Model selection**: Use `get_model(section)` from `imas_codex.settings`
108114
- **Facility config**: Use `get_facility(facility)` — never hardcode facility values
109115
- **Remote execution**: Use `run_python_script()` from `imas_codex.remote.executor`
116+
117+
### SN Key Files
118+
119+
| File | Purpose |
120+
|------|---------|
121+
| `imas_codex/sn/benchmark_reference.py` | Gold reference set (52 entries across 8 IDSs) |
122+
| `imas_codex/sn/benchmark_calibration.yaml` | Known-quality examples for reviewer consistency |
123+
| `imas_codex/llm/prompts/sn/` | LLM prompt templates for mint, review, and benchmark |

.github/skills/service-ops/SKILL.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,11 @@ uv run imas-codex llm spend # Cost tracking
7575
uv run imas-codex llm logs # View logs
7676
```
7777

78+
`sn mint` and `sn benchmark` require the LLM proxy to be running. Model names must use the
79+
`openrouter/` prefix (e.g. `openrouter/anthropic/claude-sonnet-4-5`) to preserve
80+
`cache_control` blocks — prompt caching is handled provider-side by OpenRouter, not by this
81+
codebase.
82+
7883
## SSH Tunnels
7984

8085
```bash

AGENTS.md

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -712,23 +712,53 @@ Azure Web App has continuous deployment enabled on ACR. When a new image appears
712712

713713
| Command | Purpose | Key Options |
714714
|---------|---------|-------------|
715-
| `sn build` | Generate standard names from DD paths or facility signals via LLM pipeline | `--source {dd,signals}`, `--ids`, `--domain`, `--facility`, `--cost-limit`, `--dry-run`, `--force`, `--skip-review` |
715+
| `sn mint` | Generate standard names from DD paths or facility signals via LLM pipeline | `--source {dd,signals}`, `--ids`, `--domain`, `--facility`, `--cost-limit`, `--dry-run`, `--force`, `--skip-review`, `--reset-to` |
716716
| `sn publish` | Export validated StandardName nodes to YAML catalog files | `--output-dir`, `--ids`, `--domain`, `--group-by {ids,domain,confidence}`, `--confidence-min`, `--catalog-dir`, `--create-pr` |
717717
| `sn import` | Import reviewed YAML catalog entries back into graph | `--catalog-dir` (required), `--tags`, `--dry-run`, `--check` |
718718
| `sn status` | Show standard name statistics from graph ||
719+
| `sn reset` | Reset standard names for re-processing | `--status` (required), `--to`, `--source`, `--ids`, `--dry-run` |
720+
| `sn clear` | Delete standard names from the graph (relationship-first safety model) | `--status`, `--all`, `--source`, `--ids`, `--include-accepted`, `--dry-run` |
719721
| `sn benchmark` | Benchmark LLM models on standard name generation quality | `--models`, `--source`, `--reviewer-model` |
720722

723+
### Benchmark
724+
725+
`sn benchmark` uses the same prompt pipeline as `sn mint` (system/user message split via
726+
`build_compose_context()`). Output table includes a **Cache %** column showing the prompt-cache
727+
hit rate per model (provider-side via OpenRouter — not something we implement). Scoring is
728+
**5-dimensional**: accuracy, completeness, physics_correctness, naming_convention, and
729+
overall, evaluated by a reviewer LLM against a gold reference set (`benchmark_reference.py`,
730+
52 entries across 8 IDSs). The calibration dataset (`benchmark_calibration.yaml`) provides
731+
known-quality examples for reviewer consistency checks.
732+
721733
### StandardName Lifecycle
722734

723735
```
724736
drafted → published → accepted
725737
↘ rejected
726738
```
727739

728-
- **drafted**: Generated by `sn build` (LLM pipeline)
740+
- **drafted**: Generated by `sn mint` (LLM pipeline)
729741
- **published**: Exported by `sn publish` to YAML catalog for human review
730742
- **accepted**: Imported by `sn import` from reviewed catalog (catalog-authoritative)
731743

744+
### Reset and Clear Semantics
745+
746+
**`sn reset`** — Re-processes existing nodes without deleting them. Clears transient fields
747+
(embedding, model, confidence, generated_at) and removes HAS_STANDARD_NAME and CANONICAL_UNITS
748+
relationships. Optionally changes `review_status` via `--to <status>`. Default (no `--to`) leaves
749+
status unchanged, only clears fields.
750+
751+
**`sn clear`** — Deletes StandardName nodes. Uses a relationship-first safety model: HAS_STANDARD_NAME
752+
edges are removed before deleting nodes, and scoped deletes only remove orphaned nodes. Requires
753+
either `--status <value>` or `--all`.
754+
755+
**Safety guard:** Both commands require `--include-accepted` to touch names with `review_status=accepted`.
756+
Accepted names are catalog-authoritative and should rarely be deleted from the graph.
757+
758+
**`sn mint --reset-to`** — Runs a `sn reset` before minting, scoped to the same `--ids`/`--source`
759+
filter. Accepts `extracted` or `drafted` as the target status. Useful for a clean re-run on a
760+
specific IDS without touching the rest of the graph.
761+
732762
### Write Semantics
733763

734764
Two distinct write paths with different semantics:

plans/features/standard-names/00-implementation-order.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,10 @@ All status values use past tense: drafted, published, accepted, rejected, skippe
4848
| 13 | publish-pipeline | Lossless YAML export, batched PRs | 📋 Ready | 11 (all) | 12 (feedback loop) |
4949
| 14 | mcp-tools-benchmark | SN search/fetch/list MCP tools + benchmark quality | ✅ Done | 11 (embedding) | 19 |
5050
| 15 | import-physics-domain | Import physics_domain from catalog | ✅ Done | 12 ||
51-
| ~~16~~ | ~~benchmark-parity~~ | ~~Superseded by Plan 19~~ | 🔀 Merged |||
52-
| ~~17~~ | ~~sn-lifecycle-management~~ | ~~Superseded by Plan 19~~ | 🔀 Merged |||
53-
| ~~18~~ | ~~benchmark-calibration~~ | ~~Superseded by Plan 19~~ | 🔀 Merged |||
54-
| 19 | benchmark-and-lifecycle | Benchmark parity, lifecycle mgmt, calibration, model selection | 📋 Ready | 14 ||
51+
| ~~16~~ | ~~benchmark-parity~~ | ~~Superseded by Plan 19~~ | 🗑️ Deleted |||
52+
| ~~17~~ | ~~sn-lifecycle-management~~ | ~~Superseded by Plan 19~~ | 🗑️ Deleted |||
53+
| ~~18~~ | ~~benchmark-calibration~~ | ~~Superseded by Plan 19~~ | 🗑️ Deleted |||
54+
| 19 | benchmark-and-lifecycle | Benchmark parity, lifecycle mgmt, calibration, model selection | ✅ Complete | 14 ||
5555

5656
## Deployment Waves
5757

plans/features/standard-names/16-benchmark-parity.md

Lines changed: 0 additions & 152 deletions
This file was deleted.

0 commit comments

Comments
 (0)