iterorganization
diff --git a/‎.github/agents/engineer.agent.md‎
Lines changed: 10 additions & 0 deletions b/‎.github/agents/engineer.agent.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎.github/skills/project-dev/SKILL.md‎
Lines changed: 14 additions & 0 deletions b/‎.github/skills/project-dev/SKILL.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎.github/skills/service-ops/SKILL.md‎
Lines changed: 5 additions & 0 deletions b/‎.github/skills/service-ops/SKILL.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 32 additions & 2 deletions b/‎AGENTS.md‎
Lines changed: 32 additions & 2 deletions
diff --git a/‎plans/features/standard-names/00-implementation-order.md‎
Lines changed: 4 additions & 4 deletions b/‎plans/features/standard-names/00-implementation-order.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎plans/features/standard-names/16-benchmark-parity.md‎
Lines changed: 0 additions & 152 deletions b/‎plans/features/standard-names/16-benchmark-parity.md‎
Lines changed: 0 additions & 152 deletions
@@ -84,6 +84,16 @@ When modifying LinkML schemas:
 - Documentation and prompt template updates
 - Refactors where the before/after is well-defined
 
+### Commonly-Modified Areas
+
+| Path | Purpose |
+|------|---------|
+| `imas_codex/sn/` | Standard name pipeline (mint, benchmark, graph ops) |
+| `tests/sn/` | SN test suite (mostly mock-based, no Neo4j required) |
+| `imas_codex/llm/prompts/sn/` | LLM prompt templates for SN |
+| `imas_codex/sn/benchmark_reference.py` | Gold reference set for benchmark scoring |
+| `imas_codex/sn/benchmark_calibration.yaml` | Calibration dataset for reviewer consistency |
+
 ## When to Escalate
 
 If a task requires:
 
@@ -84,6 +84,10 @@ git push origin main
 | `@pytest.mark.integration` | Full integration tests |
 | `@pytest.mark.unit` | Fast unit tests |
 
+SN tests live in `tests/sn/` and run with `uv run pytest tests/sn/ -v`. They do not require
+Neo4j unless marked `@pytest.mark.graph` — the rest use mocks. Benchmark tests validate prompt
+parity with the mint pipeline, calibration dataset integrity, and reference set coverage.
+
 ## Project Structure
 
 | Directory | Purpose |
@@ -95,7 +99,9 @@ git push origin main
 | `imas_codex/tools/` | MCP tool implementations |
 | `imas_codex/remote/` | Remote execution (SSH, scripts) |
 | `imas_codex/llm/` | LLM integration and prompt templates |
+| `imas_codex/sn/` | Standard name pipeline (mint, benchmark, graph ops) |
 | `tests/` | Test suite (mirrors source structure) |
+| `tests/sn/` | Standard name test suite (mostly mock-based) |
 | `plans/features/` | Active feature plans |
 | `agents/` | Agent documentation and schema reference |
 
@@ -107,3 +113,11 @@ git push origin main
 - **Model selection**: Use `get_model(section)` from `imas_codex.settings`
 - **Facility config**: Use `get_facility(facility)` — never hardcode facility values
 - **Remote execution**: Use `run_python_script()` from `imas_codex.remote.executor`
+
+### SN Key Files
+
+| File | Purpose |
+|------|---------|
+| `imas_codex/sn/benchmark_reference.py` | Gold reference set (52 entries across 8 IDSs) |
+| `imas_codex/sn/benchmark_calibration.yaml` | Known-quality examples for reviewer consistency |
+| `imas_codex/llm/prompts/sn/` | LLM prompt templates for mint, review, and benchmark |
@@ -75,6 +75,11 @@ uv run imas-codex llm spend           # Cost tracking
 uv run imas-codex llm logs            # View logs
 ```
 
+`sn mint` and `sn benchmark` require the LLM proxy to be running. Model names must use the
+`openrouter/` prefix (e.g. `openrouter/anthropic/claude-sonnet-4-5`) to preserve
+`cache_control` blocks — prompt caching is handled provider-side by OpenRouter, not by this
+codebase.
+
 ## SSH Tunnels
 
 ```bash
 
@@ -712,23 +712,53 @@ Azure Web App has continuous deployment enabled on ACR. When a new image appears
 
 | Command | Purpose | Key Options |
 |---------|---------|-------------|
-| `sn build` | Generate standard names from DD paths or facility signals via LLM pipeline | `--source {dd,signals}`, `--ids`, `--domain`, `--facility`, `--cost-limit`, `--dry-run`, `--force`, `--skip-review` |
+| `sn mint` | Generate standard names from DD paths or facility signals via LLM pipeline | `--source {dd,signals}`, `--ids`, `--domain`, `--facility`, `--cost-limit`, `--dry-run`, `--force`, `--skip-review`, `--reset-to` |
 | `sn publish` | Export validated StandardName nodes to YAML catalog files | `--output-dir`, `--ids`, `--domain`, `--group-by {ids,domain,confidence}`, `--confidence-min`, `--catalog-dir`, `--create-pr` |
 | `sn import` | Import reviewed YAML catalog entries back into graph | `--catalog-dir` (required), `--tags`, `--dry-run`, `--check` |
 | `sn status` | Show standard name statistics from graph | — |
+| `sn reset` | Reset standard names for re-processing | `--status` (required), `--to`, `--source`, `--ids`, `--dry-run` |
+| `sn clear` | Delete standard names from the graph (relationship-first safety model) | `--status`, `--all`, `--source`, `--ids`, `--include-accepted`, `--dry-run` |
 | `sn benchmark` | Benchmark LLM models on standard name generation quality | `--models`, `--source`, `--reviewer-model` |
 
+### Benchmark
+
+`sn benchmark` uses the same prompt pipeline as `sn mint` (system/user message split via
+`build_compose_context()`). Output table includes a **Cache %** column showing the prompt-cache
+hit rate per model (provider-side via OpenRouter — not something we implement). Scoring is
+**5-dimensional**: accuracy, completeness, physics_correctness, naming_convention, and
+overall, evaluated by a reviewer LLM against a gold reference set (`benchmark_reference.py`,
+52 entries across 8 IDSs). The calibration dataset (`benchmark_calibration.yaml`) provides
+known-quality examples for reviewer consistency checks.
+
 ### StandardName Lifecycle
 
 ```
 drafted → published → accepted
                     ↘ rejected
 ```
 
-- **drafted**: Generated by `sn build` (LLM pipeline)
+- **drafted**: Generated by `sn mint` (LLM pipeline)
 - **published**: Exported by `sn publish` to YAML catalog for human review
 - **accepted**: Imported by `sn import` from reviewed catalog (catalog-authoritative)
 
+### Reset and Clear Semantics
+
+**`sn reset`** — Re-processes existing nodes without deleting them. Clears transient fields
+(embedding, model, confidence, generated_at) and removes HAS_STANDARD_NAME and CANONICAL_UNITS
+relationships. Optionally changes `review_status` via `--to <status>`. Default (no `--to`) leaves
+status unchanged, only clears fields.
+
+**`sn clear`** — Deletes StandardName nodes. Uses a relationship-first safety model: HAS_STANDARD_NAME
+edges are removed before deleting nodes, and scoped deletes only remove orphaned nodes. Requires
+either `--status <value>` or `--all`.
+
+**Safety guard:** Both commands require `--include-accepted` to touch names with `review_status=accepted`.
+Accepted names are catalog-authoritative and should rarely be deleted from the graph.
+
+**`sn mint --reset-to`** — Runs a `sn reset` before minting, scoped to the same `--ids`/`--source`
+filter. Accepts `extracted` or `drafted` as the target status. Useful for a clean re-run on a
+specific IDS without touching the rest of the graph.
+
 ### Write Semantics
 
 Two distinct write paths with different semantics:
 
@@ -48,10 +48,10 @@ All status values use past tense: drafted, published, accepted, rejected, skippe
 | 13 | publish-pipeline | Lossless YAML export, batched PRs | 📋 Ready | 11 (all) | 12 (feedback loop) |
 | 14 | mcp-tools-benchmark | SN search/fetch/list MCP tools + benchmark quality | ✅ Done | 11 (embedding) | 19 |
 | 15 | import-physics-domain | Import physics_domain from catalog | ✅ Done | 12 | — |
-| ~~16~~ | ~~benchmark-parity~~ | ~~Superseded by Plan 19~~ | 🔀 Merged | — | — |
-| ~~17~~ | ~~sn-lifecycle-management~~ | ~~Superseded by Plan 19~~ | 🔀 Merged | — | — |
-| ~~18~~ | ~~benchmark-calibration~~ | ~~Superseded by Plan 19~~ | 🔀 Merged | — | — |
-| 19 | benchmark-and-lifecycle | Benchmark parity, lifecycle mgmt, calibration, model selection | 📋 Ready | 14 | — |
+| ~~16~~ | ~~benchmark-parity~~ | ~~Superseded by Plan 19~~ | 🗑️ Deleted | — | — |
+| ~~17~~ | ~~sn-lifecycle-management~~ | ~~Superseded by Plan 19~~ | 🗑️ Deleted | — | — |
+| ~~18~~ | ~~benchmark-calibration~~ | ~~Superseded by Plan 19~~ | 🗑️ Deleted | — | — |
+| 19 | benchmark-and-lifecycle | Benchmark parity, lifecycle mgmt, calibration, model selection | ✅ Complete | 14 | — |
 
 ## Deployment Waves