diff --git a/benchmarks/generation/README.md b/benchmarks/generation/README.md
new file mode 100644
index 00000000..f691a437
--- /dev/null
+++ b/benchmarks/generation/README.md
@@ -0,0 +1,180 @@
+---
+## Token-Oriented Object Notation vs JSON: a benchmark of plain and constrained decoding generation"
+
+[Token-Oriented Object Notation](https://github.com/toon-format) is a compact, human-readable encoding of the JSON data model that minimizes tokens and makes structure easy for models to follow. It's intended for LLM input as a drop-in, lossless representation of your existing JSON.
+
+While TOON is primarily designed for input, its token efficiency makes it a candidate for LLM output in specific high-volume scenarios. This benchmark compares three generation strategies across 21 models.
+
+### Benchmark design
+
+**Gold standard:** Created from Pydantic models and serialized to `*.gold.json` (canonical JSON) and `*.gold.toon` (via `@toon-format/cli`).
+
+**Test cases:**
+1. **users**: Simple tabular structure.
+2. **order**: Nested structure with array.
+3. **company**: Department and employee hierarchy (deep nesting).
+4. **invoice**: Items and totals.
+
+**Test tracks:**
+* **JSON track (J):** Plain JSON generation with Pydantic validation.
+* **JSON-SO track (JSO):** Structured output (`response_format="json_object"`) with constrained decoding. The inference engine compiles constraints (schema/grammar) into a state machine (e.g., xgrammar) to mask illegal tokens during generation, enforcing valid syntax.
+* **TOON track (T):** TOON output followed by CLI decoding. Prompts used **universal examples** (not custom-tailored to the specific schema) to ensure a fair comparison with JSON.
+
+**Sampling & evaluation:**
+* **Parameters:** Temperature 0 for deterministic output.
+* **Runs:** 10 iterations per test case per model (21 models via [Nebius API](https://tokenfactory.nebius.com/)).
+* **Process:**
+ 1. Model generates output (J, JSO, or T).
+ 2. (TOON only) CLI decodes to JSON. CLI errors trigger a **repair cycle**.
+ 3. Validation via Pydantic & Data canonicalization.
+ 4. Comparison with Gold Standard.
+ 5. **Repair cycle:** If validation/comparison fails, the previous output and error text are inserted into the prompt (up to 3 attempts).
+
+### Key findings
+
+* **Aligned data ("sweet spot"):** TOON excels in tabular and uniform nested structures (e.g., invoices, orders), achieving **90.5%** accuracy in 1-shot tests while offering significant token savings.
+* **Prompt tax:** Unlike JSON, which is native to model training, TOON requires instructional prompting. For short outputs, this overhead reduces efficiency; for larger outputs (batches/logs), the syntax savings amortize the cost.
+* **Structured output trade-off:** Constrained decoding (JSO) acts as a safety net for smaller models (preventing syntax errors) but was found to degrade reasoning/accuracy in some larger models ("structured output paradox").
+
+### Results by data topology
+
+Performance varies significantly based on how well the data aligns with TOON's design (e.g., uniform arrays vs. deep recursive nesting).
+
+| Case | J (1-S) | J (Fin) | J (Tok) | JSO (1-S) | JSO (Fin) | JSO (Tok) | T (1-S) | T (Fin) | T (Tok) |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| **users** | 94.8% | 94.8% | 1078 | 92.9% | **100%** | 556 | **90.5%** | 90.5% | 840 |
+| **order** | 81.9% | 81.9% | 1746 | 78.6% | 83.3% | 1255 | 74.3% | 78.6% | 1585 |
+| **company** | 18.6% | 43.8% | 3575 | **21.9%** | **48.1%** | 2592 | 0.0% | 48.6% | 2567 |
+| **invoice** | 90.0% | 90.0% | 1723 | 87.6% | **95.2%** | 1349 | 0.0% | 52.4% | 3626 |
+
+### Full results by model
+
+The following table compares **1-shot accuracy (1-S)**, **final accuracy (Fin)** after repair loops, and the total **token budget (Tok)** required for successful generation.
+
+| Model | J (1-S) | J (Fin) | J (Tok) | JSO (1-S) | JSO (Fin) | JSO (Tok) | T (1-S) | T (Fin) | T (Tok) |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| **NousResearch/Hermes-4-405B** | 92.5% | 92.5% | 3252 | 35.0% | **100%** | 4759 | 50.0% | 60.0% | 4671 |
+| **NousResearch/Hermes-4-70B** | 75.0% | 75.0% | 4414 | 37.5% | 75.0% | 5594 | 50.0% | 50.0% | 4738 |
+| **PrimeIntellect/INTELLECT-3** | 72.5% | 75.0% | 10682 | 72.5% | 77.5% | 10103 | 40.0% | 65.0% | 13315 |
+| **Qwen/Qwen2.5-Coder-7B-fast** | 0.0% | 0.0% | 37705 | 75.0% | 75.0% | 4440 | 27.5% | 27.5% | 32715 |
+| **Qwen/Qwen3-235B-A22B-Inst** | **100%** | **100%** | 2772 | **100%** | **100%** | 2772 | 50.0% | **100%** | 4715 |
+| **Qwen/Qwen3-235B-A22B-Thk** | 82.5% | 82.5% | 11425 | 87.5% | 97.5% | 7899 | 50.0% | 97.5% | 17457 |
+| **Qwen/Qwen3-30B-A3B-Inst** | 75.0% | 75.0% | 4436 | 75.0% | 75.0% | 4436 | 50.0% | 70.0% | 5505 |
+| **Qwen/Qwen3-32B** | 75.0% | 77.5% | 10196 | 75.0% | 75.0% | 4120 | 47.5% | 80.0% | 9101 |
+| **Qwen/Qwen3-Coder-30B-A3B** | 75.0% | 75.0% | 4206 | 75.0% | 75.0% | 4206 | 50.0% | **100%** | 4719 |
+| **Qwen/Qwen3-Coder-480B** | 75.0% | 75.0% | 4462 | 75.0% | 75.0% | 4447 | 50.0% | 75.0% | 4515 |
+| **deepseek-ai/DeepSeek-R1** | 55.0% | 70.0% | 13811 | 65.0% | 80.0% | 4149 | 25.0% | 50.0% | 19047 |
+| **deepseek-ai/DeepSeek-V3-fast** | 75.0% | **100%** | 3600 | 75.0% | **100%** | 3584 | 25.0% | 80.0% | 4734 |
+| **google/gemma-2-2b-it** | 75.0% | **100%** | 4721 | 77.5% | **100\%** | 4566 | 0.0% | 0.0% | 5955 |
+| **google/gemma-2-9b-it-fast** | 75.0% | 75.0% | 6086 | 75.0% | 75.0% | 6056 | 50.0% | 75.0% | 5419 |
+| **meta-llama/Llama-3.3-70B** | 75.0% | 75.0% | 4551 | 75.0% | 75.0% | 4447 | 50.0% | 50.0% | 5148 |
+| **meta-llama/Llama-3.1-8B** | 72.5% | 72.5% | 7235 | 75.0% | 75.0% | 6941 | 22.5\% | 25.0% | 4915 |
+| **moonshotai/Kimi-K2-Instruct** | 50.0% | 75.0% | 4284 | 50.0% | 75.0% | 4283 | 50.0\% | **100\%** | 3937 |
+| **nvidia/Llama-3_1-Nemotron** | 75.0% | 75.0% | 4426 | 50.0% | 50.0% | 5714 | 50.0% | 82.5% | 4368 |
+| **openai/gpt-oss-120b** | **97.5%** | **100%** | 3685 | **100%** | **100%** | 3545 | 50.0% | 87.5% | 8223 |
+| **openai/gpt-oss-20b** | 50.0% | 72.5% | 14943 | 50.0% | 67.5% | 15601 | 50.0% | 90.0% | 9678 |
+| **zai-org/GLM-4.5** | 75.0% | 87.5% | 9677 | 75.0\% | 92.5\% | 9135 | 27.5\% | 52.5\% | 8110 |
+
+### Observations
+
+**1. The "Structured Output Paradox"**
+Constrained decoding is not always superior. For `Hermes-4-405B`, applying constraints dropped 1-shot accuracy from **92.5%** (Plain JSON) to **35.0%** (Structured Output). This suggests that for some high-reasoning models, forcing specific grammar paths can actively interfere with the model's logic capabilities.
+
+**2. Guardrails for smaller models**
+Conversely, for smaller models like `Qwen/Qwen2.5-Coder-7B-fast`, structured output is essential. It raised performance from a catastrophic **0%** (Plain JSON) to a viable **75%**.
+
+**3. TOON repair potential**
+While TOON often has lower initial 1-shot accuracy due to the novelty of the format, several models (`Qwen/Qwen3-Coder-30B`, `Kimi-K2-Instruct`, `Qwen/Qwen3-235B`) achieved **100% final accuracy** after repair loops. This indicates that while the format may be unfamiliar initially, the error messages provided by the TOON CLI are highly effective for self-correction.
+
+**4. Token efficiency scaling**
+In cases like `Qwen3-235B-A22B-Inst`, TOON consumed significantly more tokens (~4700) than JSON (~2700). This confirms the "prompt tax" hypothesis: for short tasks, the instructional overhead outweighs the syntax savings. TOON becomes efficient primarily in high-volume generation where the output length justifies the system prompt.
+
+### Analysis & recommendations
+
+1. **Aligned data streams:** Use TOON generation for **SQL dumps, logs, and transactional documents**. The token savings on high-volume, uniform data outweigh the prompt overhead.
+2. **Avoid deep nesting:** For deeply nested or recursive state trees (like DOMs), stick to **JSON** or **JSO**. TOON's indentation tracking is less robust for these structures in one-shot generation.
+3. **Repair loops:** TOON generation benefits disproportionately from repair loops (feeding errors back to context), often correcting format issues that initial constrained decoding cannot fix.
+
+
+Installation & Usage (click to expand)
+
+
+
+This repository contains two main scripts:
+
+- **`generate.py`** – builds the gold-standard reference outputs used for evaluation.
+- **`eval.py`** – runs the full benchmark across all models and decoding strategies.
+
+Before running the benchmark, install dependencies and create the gold files.
+
+### **1. Install Python dependencies**
+
+```bash
+pip install -r requirements.txt
+```
+
+### **2. Install the TOON CLI (required for encoding/decoding)**
+
+```bash
+npm install -g @toon-format/cli
+```
+
+Alternatively, you can rely on `npx` without a global install.
+
+### **3. Generate the gold reference outputs**
+
+This step must be run **once**, or whenever you modify the schemas in `generate.py`:
+
+```bash
+python generate.py
+```
+
+This will create:
+
+```
+gold/users.gold.json gold/users.gold.toon
+gold/order.gold.json gold/order.gold.toon
+gold/company.gold.json gold/company.gold.toon
+gold/invoice.gold.json gold/invoice.gold.toon
+```
+
+### **4. Set your model API key**
+
+The benchmark uses the Nebius Token Factory API. Set:
+
+```bash
+export LLM_API_KEY="your_nebius_api_key"
+```
+
+### **5. Run the full benchmark**
+
+```bash
+python eval.py
+```
+
+This will:
+
+- Run all test cases for 21 models
+- Perform JSON, JSON-SO, and TOON generation
+- Decode TOON outputs via CLI
+- Validate against the gold standard
+- Apply repair loops
+- Write per-run statistics to:
+
+```
+eval_runs.csv
+```
+
+### **Repository structure**
+
+```
+├── generate.py # Defines schemas, builds gold objects, writes gold/*.json + *.toon
+├── eval.py # Full benchmark runner
+├── gold/ # Auto-generated canonical reference data
+│ ├── *.gold.json
+│ ├── *.gold.toon
+├── requirements.txt
+└── README.md
+```
+
+
diff --git a/benchmarks/generation/eval.py b/benchmarks/generation/eval.py
new file mode 100644
index 00000000..53d65a24
--- /dev/null
+++ b/benchmarks/generation/eval.py
@@ -0,0 +1,876 @@
+# eval_simple.py
+import json
+import os
+import re
+import csv
+import subprocess
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Tuple, Type
+
+from pydantic import BaseModel, TypeAdapter
+from openai import OpenAI
+from openai import APIError, InternalServerError, RateLimitError
+
+# --- Import Pydantic models from your generate.py ---
+from generate import (
+ UserRow, Order,
+ Company, Invoice,
+)
+
+# =========================================
+# Config: models + runs + output CSV
+# =========================================
+MODELS = [
+'deepseek-ai/DeepSeek-V3-0324-fast',
+'openai/gpt-oss-120b',
+'moonshotai/Kimi-K2-Instruct',
+'Qwen/Qwen3-Coder-480B-A35B-Instruct',
+'NousResearch/Hermes-4-405B',
+'NousResearch/Hermes-4-70B',
+'openai/gpt-oss-20b',
+'zai-org/GLM-4.5',
+'deepseek-ai/DeepSeek-R1-0528',
+'PrimeIntellect/INTELLECT-3',
+'Qwen/Qwen3-235B-A22B-Thinking-2507',
+'Qwen/Qwen3-235B-A22B-Instruct-2507',
+'Qwen/Qwen3-30B-A3B-Instruct-2507',
+'Qwen/Qwen3-Coder-30B-A3B-Instruct',
+'Qwen/Qwen3-32B',
+'nvidia/Llama-3_1-Nemotron-Ultra-253B-v1',
+'meta-llama/Llama-3.3-70B-Instruct',
+'meta-llama/Meta-Llama-3.1-8B-Instruct',
+'Qwen/Qwen2.5-Coder-7B-fast',
+'google/gemma-2-2b-it',
+'google/gemma-2-9b-it-fast'
+]
+RUNS_PER_MODEL = 10
+CSV_PATH = Path("eval_runs.csv")
+
+# =========================================
+# LLM client
+# =========================================
+LLM_API_KEY = os.environ.get("LLM_API_KEY")
+if not LLM_API_KEY:
+ raise RuntimeError("Missing LLM_API_KEY environment variable")
+
+client = OpenAI(
+ base_url="https://api.studio.nebius.com/v1/",
+ api_key=LLM_API_KEY,
+)
+
+SYSTEM_PROMPT = (
+ "You are a data-formatting model. "
+ "Follow instructions exactly. When asked for JSON, you must return JSON that conforms to the provided JSON Schema. "
+ "No extra text. When asked for TOON, return only a ```toon fenced block."
+)
+
+# =========================================
+# Retry wrapper for API calls
+# =========================================
+def retry_on_error(func, max_retries=5, initial_delay=2.0):
+ """Retry a function with exponential backoff on API errors."""
+ for attempt in range(max_retries):
+ try:
+ return func()
+ except (InternalServerError, APIError, RateLimitError) as e:
+ if attempt == max_retries - 1:
+ print(f"Failed after {max_retries} attempts: {e}")
+ raise
+
+ delay = initial_delay * (2 ** attempt)
+ print(f"API error (attempt {attempt + 1}/{max_retries}): {e}")
+ print(f"Retrying in {delay:.1f} seconds...")
+ time.sleep(delay)
+ except Exception as e:
+ # Don't retry on other exceptions (validation errors, etc.)
+ raise
+
+# =========================================
+# Structured JSON call (json_schema)
+# =========================================
+def llm_call_json_structured(model: str, prompt: str, schema_model: Type[BaseModel]) -> Tuple[str, int, int]:
+ """Return (json_text, prompt_tokens, completion_tokens) with JSON object output."""
+ print(f"Calling {model} json_structured")
+ # Add schema to prompt for guidance
+ schema_prompt = f"{prompt}\n\nReturn valid JSON matching this schema:\n{json.dumps(schema_model.model_json_schema(), indent=2)}"
+
+ def _call():
+ resp = client.chat.completions.create(
+ model=model,
+ max_tokens=10000,
+ temperature=0.0,
+ top_p=1.0,
+ extra_body={"top_k": 50},
+ response_format={
+ "type": "json_object",
+ },
+ messages=[
+ {"role": "system", "content": SYSTEM_PROMPT},
+ {"role": "user", "content": schema_prompt},
+ ],
+ )
+
+ msg = resp.choices[0].message
+
+ # Handle refusal
+ if msg.refusal:
+ raise ValueError(f"Model refused: {msg.refusal}")
+
+ text = (msg.content or "").strip()
+ usage = getattr(resp, "usage", None)
+ p = getattr(usage, "prompt_tokens", 0) if usage else 0
+ c = getattr(usage, "completion_tokens", 0) if usage else 0
+
+ return text, p, c
+
+ return retry_on_error(_call)
+
+# =========================================
+# Plain JSON call (no response_format)
+# =========================================
+def llm_call_json_plain(model: str, prompt: str, schema_model: Type[BaseModel]) -> Tuple[str, int, int]:
+ """Return (json_text, prompt_tokens, completion_tokens) with plain text completion."""
+ print(f"Calling {model} json_plain")
+ # Add schema to prompt for guidance
+ schema_prompt = f"{prompt}\n\nReturn valid JSON matching this schema:\n{json.dumps(schema_model.model_json_schema(), indent=2)}"
+
+ def _call():
+ resp = client.chat.completions.create(
+ model=model,
+ max_tokens=10000,
+ temperature=0.0,
+ top_p=1.0,
+ extra_body={"top_k": 50},
+ messages=[
+ {"role": "system", "content": SYSTEM_PROMPT},
+ {"role": "user", "content": schema_prompt},
+ ],
+ )
+
+ text = resp.choices[0].message.content or ""
+ # Remove think tags
+ text = re.sub(r".*?", "", text, flags=re.DOTALL).strip()
+ # Remove markdown code fences if present
+ text = re.sub(r"```(?:json)?\s*(.*?)```", r"\1", text, flags=re.DOTALL).strip()
+
+ usage = getattr(resp, "usage", None)
+ p = getattr(usage, "prompt_tokens", 0) if usage else 0
+ c = getattr(usage, "completion_tokens", 0) if usage else 0
+
+ return text, p, c
+
+ return retry_on_error(_call)
+
+# =========================================
+# Plain call (for TOON generation)
+# =========================================
+def llm_call_plain(model: str, prompt: str) -> Tuple[str, int, int]:
+ print(f"Calling {model} plain")
+
+ def _call():
+ resp = client.chat.completions.create(
+ model=model,
+ max_tokens=10000,
+ temperature=0.0,
+ top_p=1.0,
+ extra_body={"top_k": 50},
+ messages=[
+ {"role": "system", "content": SYSTEM_PROMPT},
+ {"role": "user", "content": prompt},
+ ],
+ )
+ text = resp.choices[0].message.content or ""
+ text = re.sub(r".*?", "", text, flags=re.DOTALL).strip()
+ usage = getattr(resp, "usage", None)
+ p = getattr(usage, "prompt_tokens", 0) if usage else 0
+ c = getattr(usage, "completion_tokens", 0) if usage else 0
+ return text, p, c
+
+ return retry_on_error(_call)
+
+# =========================================
+# Paths
+# =========================================
+GOLD = Path("gold")
+USERS_JSON = GOLD / "users.gold.json"
+ORDER_JSON = GOLD / "order.gold.json"
+COMPANY_JSON = GOLD / "company.gold.json"
+INVOICE_JSON = GOLD / "invoice.gold.json"
+
+# =========================================
+# Canonicalization (stable compare)
+# =========================================
+def sort_users_by_id(obj: Dict[str, Any]) -> Dict[str, Any]:
+ if "users" in obj and isinstance(obj["users"], list):
+ obj["users"] = sorted(obj["users"], key=lambda r: r.get("id"))
+ return obj
+
+def sort_order_items(obj: Dict[str, Any]) -> Dict[str, Any]:
+ if "items" in obj and isinstance(obj["items"], list):
+ obj["items"] = sorted(obj["items"], key=lambda r: r.get("sku"))
+ return obj
+
+def sort_company(obj: Dict[str, Any]) -> Dict[str, Any]:
+ if "departments" in obj and isinstance(obj["departments"], list):
+ obj["departments"] = sorted(obj["departments"], key=lambda d: d.get("code"))
+ for d in obj["departments"]:
+ if isinstance(d, dict) and "employees" in d and isinstance(d["employees"], list):
+ d["employees"] = sorted(d["employees"], key=lambda e: e.get("id"))
+ return obj
+
+def sort_invoice(obj: Dict[str, Any]) -> Dict[str, Any]:
+ if "items" in obj and isinstance(obj["items"], list):
+ obj["items"] = sorted(obj["items"], key=lambda r: r.get("sku"))
+ return obj
+
+def canonical_json(obj: Any, case: str) -> Any:
+ if case == "users": return sort_users_by_id(obj)
+ if case == "order": return sort_order_items(obj)
+ if case == "company": return sort_company(obj)
+ if case == "invoice": return sort_invoice(obj)
+ return obj
+
+# =========================================
+# Pydantic validation (+ shape normalization)
+# =========================================
+class UsersPayload(BaseModel):
+ users: List[UserRow]
+
+def validate_users_json(data: Any) -> List[UserRow]:
+ if not isinstance(data, dict) or "users" not in data:
+ raise ValueError("Expected object with key 'users'")
+ adapter = TypeAdapter(List[UserRow])
+ return adapter.validate_python(data["users"])
+
+def normalize_by_key(data: Any, key: str) -> Any:
+ if isinstance(data, dict) and key in data and isinstance(data[key], dict):
+ return data[key]
+ return data
+
+def validate_order_json(data: Any) -> Order:
+ data = normalize_by_key(data, "order") # TOON may wrap
+ adapter = TypeAdapter(Order)
+ return adapter.validate_python(data)
+
+def validate_company_json(data: Any) -> Company:
+ data = normalize_by_key(data, "company")
+ adapter = TypeAdapter(Company)
+ return adapter.validate_python(data)
+
+def validate_invoice_json(data: Any) -> Invoice:
+ data = normalize_by_key(data, "invoice")
+ adapter = TypeAdapter(Invoice)
+ return adapter.validate_python(data)
+
+# =========================================
+# TOON decode via official CLI
+# =========================================
+def extract_toon_payload(toon_text: str) -> str:
+ m = re.search(r"```toon\s*(.*?)```", toon_text, flags=re.DOTALL | re.IGNORECASE)
+ return m.group(1).strip() if m else toon_text.strip()
+
+def decode_toon_to_json(toon_text: str) -> Any:
+ payload = extract_toon_payload(toon_text)
+ proc = subprocess.run(
+ ["npx", "@toon-format/cli", "--decode"],
+ input=payload.encode("utf-8"),
+ capture_output=True,
+ check=True,
+ )
+ return json.loads(proc.stdout.decode("utf-8"))
+
+# =========================================
+# Prompts — JSON (structured) / TOON
+# =========================================
+def make_json_prompt_users() -> str:
+ return (
+ "Create a user directory with three users:\n"
+ "- User 1: Alice, who is an admin\n"
+ "- User 2: Bob, who is a staff member\n"
+ "- User 3: Eve, who is a guest\n\n"
+ "Return the data as JSON with a 'users' array containing objects with id, name, and role fields."
+ )
+
+def make_json_prompt_order() -> str:
+ return (
+ "Create an order record:\n"
+ "- Order ID: 101\n"
+ "- Customer: Ada (ID: 9)\n"
+ "- Items:\n"
+ " * Product A1: quantity 2, price $9.99 each\n"
+ " * Product B2: quantity 1, price $14.50 each\n\n"
+ "Return as JSON with fields for id, customer (with id and name), and items array (with sku, qty, price)."
+ )
+
+def make_json_prompt_company() -> str:
+ return (
+ "Create a company organization structure:\n"
+ "- Company: Acme (ID: 1)\n"
+ "- Engineering Department (code: ENG):\n"
+ " * Alice (ID: 1) - engineer\n"
+ " * Bob (ID: 2) - manager\n"
+ "- Operations Department (code: OPS):\n"
+ " * Eve (ID: 3) - analyst\n\n"
+ "Return as JSON with company info and nested departments array, each containing employees."
+ )
+
+def make_json_prompt_invoice() -> str:
+ return (
+ "Create an invoice:\n"
+ "- Invoice number: INV-2025-001\n"
+ "- Currency: USD\n"
+ "- Customer: Ada (ID: 9)\n"
+ "- Line items:\n"
+ " * A1: quantity 2 @ $9.99 each = $19.98\n"
+ " * B2: quantity 1 @ $14.50 each = $14.50\n"
+ "- Subtotal: $34.48\n"
+ "- Tax: $6.90\n"
+ "- Grand total: $41.38\n"
+ "- Notes: Thank you for your business.\n\n"
+ "Return as JSON with all invoice details including items array and totals breakdown."
+ )
+
+# =========================================
+# Improved TOON Prompts (short nested examples + same tasks)
+# =========================================
+
+
+
+
+
+
+def make_toon_prompt_users() -> str:
+ return (
+ "You are to produce output STRICTLY in TOON format.\n\n"
+ "TOON RULES:\n"
+ "- Use 2-space indentation\n"
+ "- Scalars: fieldName: value\n"
+ "- Objects: fieldName: then nested fields indented\n"
+ "- Arrays of objects:\n"
+ " arrayName[N]:\n"
+ " - field1: value1\n"
+ " field2: value2\n"
+ "- Tabular arrays (for simple data):\n"
+ " arrayName[N]{field1,field2}:\n"
+ " val1,val2\n"
+ " val3,val4\n"
+ "- [N] MUST equal actual row/item count\n"
+ "- Output ONLY a ```toon code block\n\n"
+ "Reference example:\n"
+ "```toon\n"
+ "id: 100\n"
+ "type: Sample\n"
+ "metadata:\n"
+ " version: 1\n"
+ " author: Alex\n"
+ "sections[2]:\n"
+ " - code: A\n"
+ " title: Introduction\n"
+ " items[2]{id,value}:\n"
+ " 1,First\n"
+ " 2,Second\n"
+ " - code: B\n"
+ " title: Details\n"
+ " items[1]{id,value}:\n"
+ " 3,Third\n"
+ "summary:\n"
+ " total: 3\n"
+ " status: complete\n"
+ "```\n\n"
+ "TASK:\n"
+ "Create an array named users with fields id, name, and role.\n"
+ "User data:\n"
+ "- id=1, name=Alice, role=admin\n"
+ "- id=2, name=Bob, role=staff\n"
+ "- id=3, name=Eve, role=guest\n\n"
+ "Output only the TOON code block.\n"
+ )
+
+
+
+
+def make_toon_prompt_order() -> str:
+ return (
+ "You are to produce output STRICTLY in TOON format.\n\n"
+ "TOON RULES:\n"
+ "- Use 2-space indentation\n"
+ "- Scalars: fieldName: value\n"
+ "- Objects: fieldName: then nested fields indented\n"
+ "- Arrays of objects:\n"
+ " arrayName[N]:\n"
+ " - field1: value1\n"
+ " field2: value2\n"
+ "- Tabular arrays (for simple data):\n"
+ " arrayName[N]{field1,field2}:\n"
+ " val1,val2\n"
+ " val3,val4\n"
+ "- [N] MUST equal actual row/item count\n"
+ "- Output ONLY a ```toon code block\n\n"
+ "Reference example:\n"
+ "```toon\n"
+ "id: 100\n"
+ "type: Sample\n"
+ "metadata:\n"
+ " version: 1\n"
+ " author: Alex\n"
+ "sections[2]:\n"
+ " - code: A\n"
+ " title: Introduction\n"
+ " items[2]{id,value}:\n"
+ " 1,First\n"
+ " 2,Second\n"
+ " - code: B\n"
+ " title: Details\n"
+ " items[1]{id,value}:\n"
+ " 3,Third\n"
+ "summary:\n"
+ " total: 3\n"
+ " status: complete\n"
+ "```\n\n"
+ "TASK:\n"
+ "Create an order record with fields: id, customer (with id and name), "
+ "and items array (with sku, qty, price).\n"
+ "- Order ID: 101\n"
+ "- Customer: Ada (ID: 9)\n"
+ "- Items:\n"
+ " * Product A1: quantity 2, price $9.99 each\n"
+ " * Product B2: quantity 1, price $14.50 each\n"
+ )
+
+
+def make_toon_prompt_company() -> str:
+ return (
+ "You are to produce output STRICTLY in TOON format.\n\n"
+ "TOON RULES:\n"
+ "- Use 2-space indentation\n"
+ "- Scalars: fieldName: value\n"
+ "- Objects: fieldName: then nested fields indented\n"
+ "- Arrays of objects:\n"
+ " arrayName[N]:\n"
+ " - field1: value1\n"
+ " field2: value2\n"
+ "- Tabular arrays (for simple data):\n"
+ " arrayName[N]{field1,field2}:\n"
+ " val1,val2\n"
+ " val3,val4\n"
+ "- [N] MUST equal actual row/item count\n"
+ "- Output ONLY a ```toon code block\n\n"
+ "Reference example:\n"
+ "```toon\n"
+ "id: 100\n"
+ "type: Sample\n"
+ "metadata:\n"
+ " version: 1\n"
+ " author: Alex\n"
+ "sections[2]:\n"
+ " - code: A\n"
+ " title: Introduction\n"
+ " items[2]{id,value}:\n"
+ " 1,First\n"
+ " 2,Second\n"
+ " - code: B\n"
+ " title: Details\n"
+ " items[1]{id,value}:\n"
+ " 3,Third\n"
+ "summary:\n"
+ " total: 3\n"
+ " status: complete\n"
+ "```\n\n"
+ "TASK:\n"
+ "Create a company organization structure with company info and nested departments array, each containing employees:\n"
+ "- Company: Acme (ID: 1)\n"
+ "- Engineering Department (code: ENG):\n"
+ " * Alice (ID: 1) - engineer\n"
+ " * Bob (ID: 2) - manager\n"
+ "- Operations Department (code: OPS):\n"
+ " * Eve (ID: 3) - analyst\n\n"
+ )
+
+
+def make_toon_prompt_invoice() -> str:
+ return (
+ "You are to produce output STRICTLY in TOON format.\n\n"
+ "TOON RULES:\n"
+ "- Use 2-space indentation\n"
+ "- Scalars: fieldName: value\n"
+ "- Objects: fieldName: then nested fields indented\n"
+ "- Arrays of objects:\n"
+ " arrayName[N]:\n"
+ " - field1: value1\n"
+ " field2: value2\n"
+ "- Tabular arrays (for simple data):\n"
+ " arrayName[N]{field1,field2}:\n"
+ " val1,val2\n"
+ " val3,val4\n"
+ "- [N] MUST equal actual row/item count\n"
+ "- Output ONLY a ```toon code block\n\n"
+ "Reference example:\n"
+ "```toon\n"
+ "id: 100\n"
+ "type: Sample\n"
+ "metadata:\n"
+ " version: 1\n"
+ " author: Alex\n"
+ "sections[2]:\n"
+ " - code: A\n"
+ " title: Introduction\n"
+ " items[2]{id,value}:\n"
+ " 1,First\n"
+ " 2,Second\n"
+ " - code: B\n"
+ " title: Details\n"
+ " items[1]{id,value}:\n"
+ " 3,Third\n"
+ "summary:\n"
+ " total: 3\n"
+ " status: complete\n"
+ "```\n\n"
+ "TASK:\n"
+ "Create an invoice with all invoice details including items array and totals breakdown:\n"
+ "- Invoice number: INV-2025-001\n"
+ "- Currency: USD\n"
+ "- Customer: Ada (ID: 9)\n"
+ "- Line items:\n"
+ " * A1: quantity 2 @ $9.99 each = $19.98\n"
+ " * B2: quantity 1 @ $14.50 each = $14.50\n"
+ "- Subtotal: $34.48\n"
+ "- Tax: $6.90\n"
+ "- Grand total: $41.38\n"
+ "- Notes: Thank you for your business.\n"
+ )
+# =========================================
+# Repair prompts
+# =========================================
+def make_json_repair_prompt(prev_output: str, error_msg: str) -> str:
+ return (
+ "Your previous JSON did not validate against the schema. "
+ "Return ONLY valid JSON (no prose, no fences) that matches the schema and the target values.\n"
+ f"Validation error:\n{error_msg}\n\n"
+ "Previous output:\n"
+ f"{prev_output}\n"
+ )
+
+def make_toon_repair_prompt(prev_output: str, error_msg: str) -> str:
+ return (
+ "Your previous TOON was invalid. Return ONLY a ```toon fenced block.\n"
+ "- Use 2-space indentation; no trailing spaces.\n"
+ "- Ensure headers/fieldsets and [N] match row counts.\n"
+ f"Validation/decoding error:\n{error_msg}\n\n"
+ "Previous output:\n"
+ f"{prev_output}\n"
+ )
+
+# =========================================
+# Core evaluation (one-shot + ≤9 repairs)
+# =========================================
+MAX_ATTEMPTS = 3
+
+def eval_json_track(
+ model: str,
+ make_prompt_fn,
+ schema_model: Type[BaseModel],
+ validate_fn,
+ gold_obj,
+ canon_case: str,
+):
+ tokens_p = tokens_c = 0
+ prompt = make_prompt_fn()
+ out, p, c = llm_call_json_structured(model, prompt, schema_model); tokens_p += p; tokens_c += c
+ try:
+ parsed = json.loads(out)
+ # print(f"JSON SO parsed: {parsed}")
+ validate_fn(parsed) # Pydantic
+ parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case)
+ one_shot_ok = final_ok = (parsed == gold_obj)
+ if final_ok:
+ return dict(one_shot_ok=True, final_ok=True, attempts_used=1,
+ tokens_prompt=tokens_p, tokens_completion=tokens_c)
+ except Exception as e:
+ err = str(e); one_shot_ok = False; prev = out
+ else:
+ err = "Structure valid but values differ from expected gold."; prev = out
+
+ for i in range(1, MAX_ATTEMPTS):
+ repair_prompt = make_json_repair_prompt(prev, err)
+ out, p, c = llm_call_json_structured(model, repair_prompt, schema_model); tokens_p += p; tokens_c += c
+ try:
+ parsed = json.loads(out)
+ validate_fn(parsed)
+ parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case)
+ final_ok = (parsed == gold_obj)
+ if final_ok:
+ return dict(one_shot_ok=one_shot_ok, final_ok=True, attempts_used=i+1,
+ tokens_prompt=tokens_p, tokens_completion=tokens_c)
+ else:
+ err = "Structure valid but values differ from expected gold."
+ prev = out
+ except Exception as e:
+ err = str(e); prev = out
+ continue
+
+ return dict(one_shot_ok=one_shot_ok, final_ok=False, attempts_used=MAX_ATTEMPTS,
+ tokens_prompt=tokens_p, tokens_completion=tokens_c)
+
+def eval_json_plain_track(
+ model: str,
+ make_prompt_fn,
+ schema_model: Type[BaseModel],
+ validate_fn,
+ gold_obj,
+ canon_case: str,
+):
+ """Evaluate JSON generation without response_format (plain completion)."""
+ tokens_p = tokens_c = 0
+ prompt = make_prompt_fn()
+ out, p, c = llm_call_json_plain(model, prompt, schema_model); tokens_p += p; tokens_c += c
+ try:
+ parsed = json.loads(out)
+ # print(f"JSON plain parsed: {parsed}")
+ validate_fn(parsed) # Pydantic
+ parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case)
+ one_shot_ok = final_ok = (parsed == gold_obj)
+ if final_ok:
+ return dict(one_shot_ok=True, final_ok=True, attempts_used=1,
+ tokens_prompt=tokens_p, tokens_completion=tokens_c)
+ except Exception as e:
+ err = str(e); one_shot_ok = False; prev = out
+ else:
+ err = "Structure valid but values differ from expected gold."; prev = out
+
+ for i in range(1, MAX_ATTEMPTS):
+ repair_prompt = make_json_repair_prompt(prev, err)
+ out, p, c = llm_call_json_plain(model, repair_prompt, schema_model); tokens_p += p; tokens_c += c
+ try:
+ parsed = json.loads(out)
+ validate_fn(parsed)
+ parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case)
+ final_ok = (parsed == gold_obj)
+ if final_ok:
+ return dict(one_shot_ok=one_shot_ok, final_ok=True, attempts_used=i+1,
+ tokens_prompt=tokens_p, tokens_completion=tokens_c)
+ else:
+ err = "Structure valid but values differ from expected gold."
+ prev = out
+ except Exception as e:
+ err = str(e); prev = out
+ continue
+
+ return dict(one_shot_ok=one_shot_ok, final_ok=False, attempts_used=MAX_ATTEMPTS,
+ tokens_prompt=tokens_p, tokens_completion=tokens_c)
+
+def eval_toon_track(model: str, make_prompt_fn, validate_fn, gold_obj, canon_case: str):
+ tokens_p = tokens_c = 0
+ prompt = make_prompt_fn()
+ out, p, c = llm_call_plain(model, prompt); tokens_p += p; tokens_c += c
+ try:
+ decoded = decode_toon_to_json(out)
+ # print(f"TOON decoded: {decoded}")
+ validate_fn(decoded)
+ decoded = canonical_json(normalize_by_key(decoded, canon_case), canon_case)
+ one_shot_ok = final_ok = (decoded == gold_obj)
+ if final_ok:
+ return dict(one_shot_ok=True, final_ok=True, attempts_used=1,
+ tokens_prompt=tokens_p, tokens_completion=tokens_c)
+ except Exception as e:
+ err = str(e); one_shot_ok = False; prev = out
+ else:
+ err = "Structure valid but values differ from expected gold."; prev = out
+
+ for i in range(1, MAX_ATTEMPTS):
+ repair_prompt = make_toon_repair_prompt(prev, err)
+ out, p, c = llm_call_plain(model, repair_prompt); tokens_p += p; tokens_c += c
+ try:
+ decoded = decode_toon_to_json(out)
+ validate_fn(decoded)
+ decoded = canonical_json(normalize_by_key(decoded, canon_case), canon_case)
+ final_ok = (decoded == gold_obj)
+ if final_ok:
+ return dict(one_shot_ok=one_shot_ok, final_ok=True, attempts_used=i+1,
+ tokens_prompt=tokens_p, tokens_completion=tokens_c)
+ else:
+ err = "Structure valid but values differ from expected gold."
+ prev = out
+ except Exception as e:
+ err = str(e); prev = out
+ continue
+
+ return dict(one_shot_ok=one_shot_ok, final_ok=False, attempts_used=MAX_ATTEMPTS,
+ tokens_prompt=tokens_p, tokens_completion=tokens_c)
+
+# =========================================
+# Case runners aggregating metrics
+# =========================================
+def run_case_users(model: str):
+ gold = json.loads(USERS_JSON.read_text(encoding="utf-8"))
+ gold = canonical_json(gold, "users")
+ jm = eval_json_track(model, make_json_prompt_users, UsersPayload, validate_users_json, gold, "users")
+ jpm = eval_json_plain_track(model, make_json_prompt_users, UsersPayload, validate_users_json, gold, "users")
+ tm = eval_toon_track(model, make_toon_prompt_users, validate_users_json, gold, "users")
+ return {
+ "users_json_one_shot": jm["one_shot_ok"], "users_json_final": jm["final_ok"],
+ "users_json_attempts": jm["attempts_used"],
+ "users_json_tokens_prompt": jm["tokens_prompt"], "users_json_tokens_completion": jm["tokens_completion"],
+ "users_json_plain_one_shot": jpm["one_shot_ok"], "users_json_plain_final": jpm["final_ok"],
+ "users_json_plain_attempts": jpm["attempts_used"],
+ "users_json_plain_tokens_prompt": jpm["tokens_prompt"], "users_json_plain_tokens_completion": jpm["tokens_completion"],
+ "users_toon_one_shot": tm["one_shot_ok"], "users_toon_final": tm["final_ok"],
+ "users_toon_attempts": tm["attempts_used"],
+ "users_toon_tokens_prompt": tm["tokens_prompt"], "users_toon_tokens_completion": tm["tokens_completion"],
+ }
+
+def run_case_order(model: str):
+ gold = json.loads(ORDER_JSON.read_text(encoding="utf-8"))
+ gold = canonical_json(gold, "order")
+ jm = eval_json_track(model, make_json_prompt_order, Order, validate_order_json, gold, "order")
+ jpm = eval_json_plain_track(model, make_json_prompt_order, Order, validate_order_json, gold, "order")
+ tm = eval_toon_track(model, make_toon_prompt_order, validate_order_json, gold, "order")
+ return {
+ "order_json_one_shot": jm["one_shot_ok"], "order_json_final": jm["final_ok"],
+ "order_json_attempts": jm["attempts_used"],
+ "order_json_tokens_prompt": jm["tokens_prompt"], "order_json_tokens_completion": jm["tokens_completion"],
+ "order_json_plain_one_shot": jpm["one_shot_ok"], "order_json_plain_final": jpm["final_ok"],
+ "order_json_plain_attempts": jpm["attempts_used"],
+ "order_json_plain_tokens_prompt": jpm["tokens_prompt"], "order_json_plain_tokens_completion": jpm["tokens_completion"],
+ "order_toon_one_shot": tm["one_shot_ok"], "order_toon_final": tm["final_ok"],
+ "order_toon_attempts": tm["attempts_used"],
+ "order_toon_tokens_prompt": tm["tokens_prompt"], "order_toon_tokens_completion": tm["tokens_completion"],
+ }
+
+def run_case_company(model: str):
+ gold = json.loads(COMPANY_JSON.read_text(encoding="utf-8"))
+ gold = canonical_json(gold, "company")
+ jm = eval_json_track(model, make_json_prompt_company, Company, validate_company_json, gold, "company")
+ jpm = eval_json_plain_track(model, make_json_prompt_company, Company, validate_company_json, gold, "company")
+ tm = eval_toon_track(model, make_toon_prompt_company, validate_company_json, gold, "company")
+ return {
+ "company_json_one_shot": jm["one_shot_ok"], "company_json_final": jm["final_ok"],
+ "company_json_attempts": jm["attempts_used"],
+ "company_json_tokens_prompt": jm["tokens_prompt"], "company_json_tokens_completion": jm["tokens_completion"],
+ "company_json_plain_one_shot": jpm["one_shot_ok"], "company_json_plain_final": jpm["final_ok"],
+ "company_json_plain_attempts": jpm["attempts_used"],
+ "company_json_plain_tokens_prompt": jpm["tokens_prompt"], "company_json_plain_tokens_completion": jpm["tokens_completion"],
+ "company_toon_one_shot": tm["one_shot_ok"], "company_toon_final": tm["final_ok"],
+ "company_toon_attempts": tm["attempts_used"],
+ "company_toon_tokens_prompt": tm["tokens_prompt"], "company_toon_tokens_completion": tm["tokens_completion"],
+ }
+
+def run_case_invoice(model: str):
+ gold = json.loads(INVOICE_JSON.read_text(encoding="utf-8"))
+ gold = canonical_json(gold, "invoice")
+ jm = eval_json_track(model, make_json_prompt_invoice, Invoice, validate_invoice_json, gold, "invoice")
+ jpm = eval_json_plain_track(model, make_json_prompt_invoice, Invoice, validate_invoice_json, gold, "invoice")
+ tm = eval_toon_track(model, make_toon_prompt_invoice, validate_invoice_json, gold, "invoice")
+ return {
+ "invoice_json_one_shot": jm["one_shot_ok"], "invoice_json_final": jm["final_ok"],
+ "invoice_json_attempts": jm["attempts_used"],
+ "invoice_json_tokens_prompt": jm["tokens_prompt"], "invoice_json_tokens_completion": jm["tokens_completion"],
+ "invoice_json_plain_one_shot": jpm["one_shot_ok"], "invoice_json_plain_final": jpm["final_ok"],
+ "invoice_json_plain_attempts": jpm["attempts_used"],
+ "invoice_json_plain_tokens_prompt": jpm["tokens_prompt"], "invoice_json_plain_tokens_completion": jpm["tokens_completion"],
+ "invoice_toon_one_shot": tm["one_shot_ok"], "invoice_toon_final": tm["final_ok"],
+ "invoice_toon_attempts": tm["attempts_used"],
+ "invoice_toon_tokens_prompt": tm["tokens_prompt"], "invoice_toon_tokens_completion": tm["tokens_completion"],
+ }
+
+# =========================================
+# Summary helpers
+# =========================================
+def summarize_formats(results: Dict[str, Any]) -> Dict[str, Any]:
+ cases = ["users", "order", "company", "invoice"]
+ summary = {}
+ for fmt in ["json", "json_plain", "toon"]:
+ one_shot_hits = sum(1 for case in cases if results.get(f"{case}_{fmt}_one_shot"))
+ final_hits = sum(1 for case in cases if results.get(f"{case}_{fmt}_final"))
+ n = len(cases)
+ prompt_tokens = sum(results.get(f"{case}_{fmt}_tokens_prompt", 0) for case in cases)
+ comp_tokens = sum(results.get(f"{case}_{fmt}_tokens_completion", 0) for case in cases)
+ summary[f"{fmt}_one_shot_accuracy"] = one_shot_hits / n if n else 0.0
+ summary[f"{fmt}_final_accuracy"] = final_hits / n if n else 0.0
+ summary[f"{fmt}_prompt_tokens"] = prompt_tokens
+ summary[f"{fmt}_completion_tokens"] = comp_tokens
+ summary[f"{fmt}_total_tokens"] = prompt_tokens + comp_tokens
+ summary["overall_prompt_tokens"] = summary["json_prompt_tokens"] + summary["json_plain_prompt_tokens"] + summary["toon_prompt_tokens"]
+ summary["overall_completion_tokens"] = summary["json_completion_tokens"] + summary["json_plain_completion_tokens"] + summary["toon_completion_tokens"]
+ summary["overall_total_tokens"] = summary["json_total_tokens"] + summary["json_plain_total_tokens"] + summary["toon_total_tokens"]
+ return summary
+
+def flatten_for_csv(model: str, run_idx: int, results: Dict[str, Any]) -> Dict[str, Any]:
+ row = {"model": model, "run": run_idx}
+ for case in ["users", "order", "company", "invoice"]:
+ for fmt in ["json", "json_plain", "toon"]:
+ row[f"{case}_{fmt}_one_shot"] = results.get(f"{case}_{fmt}_one_shot", False)
+ row[f"{case}_{fmt}_final"] = results.get(f"{case}_{fmt}_final", False)
+ row[f"{case}_{fmt}_attempts"] = results.get(f"{case}_{fmt}_attempts", 0)
+ row[f"{case}_{fmt}_prompt_tokens"] = results.get(f"{case}_{fmt}_tokens_prompt", 0)
+ row[f"{case}_{fmt}_completion_tokens"] = results.get(f"{case}_{fmt}_tokens_completion", 0)
+ summary = summarize_formats(results)
+ row.update({
+ "json_one_shot_accuracy": summary["json_one_shot_accuracy"],
+ "json_final_accuracy": summary["json_final_accuracy"],
+ "json_prompt_tokens": summary["json_prompt_tokens"],
+ "json_completion_tokens": summary["json_completion_tokens"],
+ "json_total_tokens": summary["json_total_tokens"],
+ "json_plain_one_shot_accuracy": summary["json_plain_one_shot_accuracy"],
+ "json_plain_final_accuracy": summary["json_plain_final_accuracy"],
+ "json_plain_prompt_tokens": summary["json_plain_prompt_tokens"],
+ "json_plain_completion_tokens": summary["json_plain_completion_tokens"],
+ "json_plain_total_tokens": summary["json_plain_total_tokens"],
+ "toon_one_shot_accuracy": summary["toon_one_shot_accuracy"],
+ "toon_final_accuracy": summary["toon_final_accuracy"],
+ "toon_prompt_tokens": summary["toon_prompt_tokens"],
+ "toon_completion_tokens": summary["toon_completion_tokens"],
+ "toon_total_tokens": summary["toon_total_tokens"],
+ "overall_prompt_tokens": summary["overall_prompt_tokens"],
+ "overall_completion_tokens": summary["overall_completion_tokens"],
+ "overall_total_tokens": summary["overall_total_tokens"],
+ })
+ return row
+
+# =========================================
+# Main (iterate models × runs, write CSV)
+# =========================================
+if __name__ == "__main__":
+ header_fields = ["model", "run"]
+ for case in ["users", "order", "company", "invoice"]:
+ for fmt in ["json", "json_plain", "toon"]:
+ header_fields += [
+ f"{case}_{fmt}_one_shot",
+ f"{case}_{fmt}_final",
+ f"{case}_{fmt}_attempts",
+ f"{case}_{fmt}_prompt_tokens",
+ f"{case}_{fmt}_completion_tokens",
+ ]
+ header_fields += [
+ "json_one_shot_accuracy","json_final_accuracy",
+ "json_prompt_tokens","json_completion_tokens","json_total_tokens",
+ "json_plain_one_shot_accuracy","json_plain_final_accuracy",
+ "json_plain_prompt_tokens","json_plain_completion_tokens","json_plain_total_tokens",
+ "toon_one_shot_accuracy","toon_final_accuracy",
+ "toon_prompt_tokens","toon_completion_tokens","toon_total_tokens",
+ "overall_prompt_tokens","overall_completion_tokens","overall_total_tokens",
+ ]
+
+ write_header = not CSV_PATH.exists()
+ with CSV_PATH.open("a", newline="", encoding="utf-8") as f:
+ writer = csv.DictWriter(f, fieldnames=header_fields)
+ if write_header:
+ writer.writeheader()
+
+ for model in MODELS:
+ print(f"Processing {model}...")
+ for run_idx in range(1, RUNS_PER_MODEL + 1):
+ print(f"Run {run_idx}...")
+ results: Dict[str, Any] = {}
+ results.update(run_case_users(model))
+ print("Users done")
+ results.update(run_case_order(model))
+ print("Order done")
+ results.update(run_case_company(model))
+ print("Company done")
+ results.update(run_case_invoice(model))
+ print("Invoice done")
+ row = flatten_for_csv(model, run_idx, results)
+ writer.writerow(row)
+
+ print(f"Wrote per-run stats to {CSV_PATH.resolve()}")
\ No newline at end of file
diff --git a/benchmarks/generation/generate.py b/benchmarks/generation/generate.py
new file mode 100644
index 00000000..8646c552
--- /dev/null
+++ b/benchmarks/generation/generate.py
@@ -0,0 +1,168 @@
+# generate.py
+from typing import List, Literal, Optional
+import json
+import subprocess
+from pathlib import Path
+from pydantic import BaseModel, ConfigDict, Field
+
+# ---------- Pydantic models (simple) ----------
+class UserRow(BaseModel):
+ model_config = ConfigDict(extra='forbid', strict=True)
+ id: int
+ name: str = Field(min_length=1)
+ role: Literal['admin', 'staff', 'guest']
+
+class Customer(BaseModel):
+ model_config = ConfigDict(extra='forbid', strict=True)
+ id: int
+ name: str
+
+class OrderItem(BaseModel):
+ model_config = ConfigDict(extra='forbid', strict=True)
+ sku: str
+ qty: int
+ price: float
+
+class Order(BaseModel):
+ model_config = ConfigDict(extra='forbid', strict=True)
+ id: int
+ customer: Customer
+ items: List[OrderItem]
+
+
+# ---------- Pydantic models (more complex #1: company) ----------
+class Employee(BaseModel):
+ model_config = ConfigDict(extra='forbid', strict=True)
+ id: int
+ name: str
+ title: Literal['engineer', 'manager', 'analyst']
+
+class Department(BaseModel):
+ model_config = ConfigDict(extra='forbid', strict=True)
+ code: str
+ name: str
+ employees: List[Employee]
+
+class Company(BaseModel):
+ model_config = ConfigDict(extra='forbid', strict=True)
+ id: int
+ name: str
+ departments: List[Department]
+
+
+# ---------- Pydantic models (more complex #2: invoice) ----------
+class InvoiceLine(BaseModel):
+ model_config = ConfigDict(extra='forbid', strict=True)
+ sku: str
+ qty: int
+ unit_price: float
+ line_total: float # keep explicit to avoid computed logic here
+
+class Totals(BaseModel):
+ model_config = ConfigDict(extra='forbid', strict=True)
+ subtotal: float
+ tax: float
+ grand_total: float
+
+class Invoice(BaseModel):
+ model_config = ConfigDict(extra='forbid', strict=True)
+ number: str
+ currency: Literal['USD', 'EUR', 'SAR']
+ customer: Customer
+ items: List[InvoiceLine]
+ totals: Totals
+ notes: Optional[str] = None
+
+
+# ---------- Create gold Python objects ----------
+# 1) Tabular users
+users = [
+ UserRow(id=1, name="Alice", role="admin"),
+ UserRow(id=2, name="Bob", role="staff"),
+ UserRow(id=3, name="Eve", role="guest"),
+]
+users_gold = {"users": [u.model_dump() for u in users]}
+
+# 2) Nested order
+order_gold = Order(
+ id=101,
+ customer=Customer(id=9, name="Ada"),
+ items=[
+ OrderItem(sku="A1", qty=2, price=9.99),
+ OrderItem(sku="B2", qty=1, price=14.50),
+ ],
+).model_dump()
+
+# 3) More complex: company with nested tabular arrays
+company_gold = Company(
+ id=1,
+ name="Acme",
+ departments=[
+ Department(
+ code="ENG",
+ name="Engineering",
+ employees=[
+ Employee(id=1, name="Alice", title="engineer"),
+ Employee(id=2, name="Bob", title="manager"),
+ ],
+ ),
+ Department(
+ code="OPS",
+ name="Operations",
+ employees=[
+ Employee(id=3, name="Eve", title="analyst"),
+ ],
+ ),
+ ],
+).model_dump()
+
+# 4) More complex: invoice with nested objects + tabular line items
+invoice_gold = Invoice(
+ number="INV-2025-001",
+ currency="USD",
+ customer=Customer(id=9, name="Ada"),
+ items=[
+ InvoiceLine(sku="A1", qty=2, unit_price=9.99, line_total=19.98),
+ InvoiceLine(sku="B2", qty=1, unit_price=14.50, line_total=14.50),
+ ],
+ totals=Totals(subtotal=34.48, tax=6.90, grand_total=41.38),
+ notes="Thank you for your business.",
+).model_dump()
+
+
+# ---------- Write gold JSON to disk ----------
+outdir = Path("gold")
+outdir.mkdir(exist_ok=True)
+
+def write_json(path: Path, obj) -> None:
+ path.write_text(json.dumps(obj, ensure_ascii=False, separators=(",", ":")), encoding="utf-8")
+
+users_json_path = outdir / "users.gold.json"
+order_json_path = outdir / "order.gold.json"
+company_json_path = outdir / "company.gold.json"
+invoice_json_path = outdir / "invoice.gold.json"
+
+write_json(users_json_path, users_gold)
+write_json(order_json_path, order_gold)
+write_json(company_json_path, company_gold)
+write_json(invoice_json_path, invoice_gold)
+
+
+# ---------- Use TOON CLI via npx to encode JSON -> TOON ----------
+def encode_to_toon(json_path: Path, toon_path: Path) -> None:
+ subprocess.run(
+ ["npx", "@toon-format/cli", str(json_path), "-o", str(toon_path)],
+ check=True,
+ )
+
+encode_to_toon(users_json_path, outdir / "users.gold.toon")
+encode_to_toon(order_json_path, outdir / "order.gold.toon")
+encode_to_toon(company_json_path, outdir / "company.gold.toon")
+encode_to_toon(invoice_json_path, outdir / "invoice.gold.toon")
+
+print("Wrote:")
+for p in [users_json_path, outdir / "users.gold.toon",
+ order_json_path, outdir / "order.gold.toon",
+ company_json_path, outdir / "company.gold.toon",
+ invoice_json_path, outdir / "invoice.gold.toon"]:
+ print(f" {p}")
diff --git a/benchmarks/generation/gold/company.gold.json b/benchmarks/generation/gold/company.gold.json
new file mode 100644
index 00000000..77238a84
--- /dev/null
+++ b/benchmarks/generation/gold/company.gold.json
@@ -0,0 +1 @@
+{"id":1,"name":"Acme","departments":[{"code":"ENG","name":"Engineering","employees":[{"id":1,"name":"Alice","title":"engineer"},{"id":2,"name":"Bob","title":"manager"}]},{"code":"OPS","name":"Operations","employees":[{"id":3,"name":"Eve","title":"analyst"}]}]}
\ No newline at end of file
diff --git a/benchmarks/generation/gold/company.gold.toon b/benchmarks/generation/gold/company.gold.toon
new file mode 100644
index 00000000..43b015bb
--- /dev/null
+++ b/benchmarks/generation/gold/company.gold.toon
@@ -0,0 +1,12 @@
+id: 1
+name: Acme
+departments[2]:
+ - code: ENG
+ name: Engineering
+ employees[2]{id,name,title}:
+ 1,Alice,engineer
+ 2,Bob,manager
+ - code: OPS
+ name: Operations
+ employees[1]{id,name,title}:
+ 3,Eve,analyst
\ No newline at end of file
diff --git a/benchmarks/generation/gold/invoice.gold.json b/benchmarks/generation/gold/invoice.gold.json
new file mode 100644
index 00000000..6b3ce1d0
--- /dev/null
+++ b/benchmarks/generation/gold/invoice.gold.json
@@ -0,0 +1 @@
+{"number":"INV-2025-001","currency":"USD","customer":{"id":9,"name":"Ada"},"items":[{"sku":"A1","qty":2,"unit_price":9.99,"line_total":19.98},{"sku":"B2","qty":1,"unit_price":14.5,"line_total":14.5}],"totals":{"subtotal":34.48,"tax":6.9,"grand_total":41.38},"notes":"Thank you for your business."}
\ No newline at end of file
diff --git a/benchmarks/generation/gold/invoice.gold.toon b/benchmarks/generation/gold/invoice.gold.toon
new file mode 100644
index 00000000..d1e93c91
--- /dev/null
+++ b/benchmarks/generation/gold/invoice.gold.toon
@@ -0,0 +1,13 @@
+number: INV-2025-001
+currency: USD
+customer:
+ id: 9
+ name: Ada
+items[2]{sku,qty,unit_price,line_total}:
+ A1,2,9.99,19.98
+ B2,1,14.5,14.5
+totals:
+ subtotal: 34.48
+ tax: 6.9
+ grand_total: 41.38
+notes: Thank you for your business.
\ No newline at end of file
diff --git a/benchmarks/generation/gold/order.gold.json b/benchmarks/generation/gold/order.gold.json
new file mode 100644
index 00000000..c3c2a604
--- /dev/null
+++ b/benchmarks/generation/gold/order.gold.json
@@ -0,0 +1 @@
+{"id":101,"customer":{"id":9,"name":"Ada"},"items":[{"sku":"A1","qty":2,"price":9.99},{"sku":"B2","qty":1,"price":14.5}]}
\ No newline at end of file
diff --git a/benchmarks/generation/gold/order.gold.toon b/benchmarks/generation/gold/order.gold.toon
new file mode 100644
index 00000000..ebc448fc
--- /dev/null
+++ b/benchmarks/generation/gold/order.gold.toon
@@ -0,0 +1,7 @@
+id: 101
+customer:
+ id: 9
+ name: Ada
+items[2]{sku,qty,price}:
+ A1,2,9.99
+ B2,1,14.5
\ No newline at end of file
diff --git a/benchmarks/generation/gold/users.gold.json b/benchmarks/generation/gold/users.gold.json
new file mode 100644
index 00000000..ac535be2
--- /dev/null
+++ b/benchmarks/generation/gold/users.gold.json
@@ -0,0 +1 @@
+{"users":[{"id":1,"name":"Alice","role":"admin"},{"id":2,"name":"Bob","role":"staff"},{"id":3,"name":"Eve","role":"guest"}]}
\ No newline at end of file
diff --git a/benchmarks/generation/gold/users.gold.toon b/benchmarks/generation/gold/users.gold.toon
new file mode 100644
index 00000000..f19feaad
--- /dev/null
+++ b/benchmarks/generation/gold/users.gold.toon
@@ -0,0 +1,4 @@
+users[3]{id,name,role}:
+ 1,Alice,admin
+ 2,Bob,staff
+ 3,Eve,guest
\ No newline at end of file
diff --git a/benchmarks/generation/requirements.txt b/benchmarks/generation/requirements.txt
new file mode 100644
index 00000000..4505df10
--- /dev/null
+++ b/benchmarks/generation/requirements.txt
@@ -0,0 +1,3 @@
+pydantic>=2.5.0
+openai>=1.0.0
+pandas>=2.0.0