diff --git a/benchmarks/generation/README.md b/benchmarks/generation/README.md
new file mode 100644
index 00000000..f691a437
--- /dev/null
+++ b/benchmarks/generation/README.md
@@ -0,0 +1,180 @@
+---
+## Token-Oriented Object Notation vs JSON: a benchmark of plain and constrained decoding generation"
+
+[Token-Oriented Object Notation](https://github.com/toon-format) is a compact, human-readable encoding of the JSON data model that minimizes tokens and makes structure easy for models to follow. It's intended for LLM input as a drop-in, lossless representation of your existing JSON.
+
+While TOON is primarily designed for input, its token efficiency makes it a candidate for LLM output in specific high-volume scenarios. This benchmark compares three generation strategies across 21 models.
+
+### Benchmark design
+
+**Gold standard:** Created from Pydantic models and serialized to `*.gold.json` (canonical JSON) and `*.gold.toon` (via `@toon-format/cli`).
+
+**Test cases:**
+1.  **users**: Simple tabular structure.
+2.  **order**: Nested structure with array.
+3.  **company**: Department and employee hierarchy (deep nesting).
+4.  **invoice**: Items and totals.
+
+**Test tracks:**
+*   **JSON track (J):** Plain JSON generation with Pydantic validation.
+*   **JSON-SO track (JSO):** Structured output (`response_format="json_object"`) with constrained decoding. The inference engine compiles constraints (schema/grammar) into a state machine (e.g., xgrammar) to mask illegal tokens during generation, enforcing valid syntax.
+*   **TOON track (T):** TOON output followed by CLI decoding. Prompts used **universal examples** (not custom-tailored to the specific schema) to ensure a fair comparison with JSON.
+
+**Sampling & evaluation:**
+*   **Parameters:** Temperature 0 for deterministic output.
+*   **Runs:** 10 iterations per test case per model (21 models via [Nebius API](https://tokenfactory.nebius.com/)).
+*   **Process:**
+    1.  Model generates output (J, JSO, or T).
+    2.  (TOON only) CLI decodes to JSON. CLI errors trigger a **repair cycle**.
+    3.  Validation via Pydantic & Data canonicalization.
+    4.  Comparison with Gold Standard.
+    5.  **Repair cycle:** If validation/comparison fails, the previous output and error text are inserted into the prompt (up to 3 attempts).
+
+### Key findings
+
+*   **Aligned data ("sweet spot"):** TOON excels in tabular and uniform nested structures (e.g., invoices, orders), achieving **90.5%** accuracy in 1-shot tests while offering significant token savings.
+*   **Prompt tax:** Unlike JSON, which is native to model training, TOON requires instructional prompting. For short outputs, this overhead reduces efficiency; for larger outputs (batches/logs), the syntax savings amortize the cost.
+*   **Structured output trade-off:** Constrained decoding (JSO) acts as a safety net for smaller models (preventing syntax errors) but was found to degrade reasoning/accuracy in some larger models ("structured output paradox").
+
+### Results by data topology
+
+Performance varies significantly based on how well the data aligns with TOON's design (e.g., uniform arrays vs. deep recursive nesting).
+
+| Case | J (1-S) | J (Fin) | J (Tok) | JSO (1-S) | JSO (Fin) | JSO (Tok) | T (1-S) | T (Fin) | T (Tok) |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| **users** | 94.8% | 94.8% | 1078 | 92.9% | **100%** | 556 | **90.5%** | 90.5% | 840 |
+| **order** | 81.9% | 81.9% | 1746 | 78.6% | 83.3% | 1255 | 74.3% | 78.6% | 1585 |
+| **company** | 18.6% | 43.8% | 3575 | **21.9%** | **48.1%** | 2592 | 0.0% | 48.6% | 2567 |
+| **invoice** | 90.0% | 90.0% | 1723 | 87.6% | **95.2%** | 1349 | 0.0% | 52.4% | 3626 |
+
+### Full results by model
+
+The following table compares **1-shot accuracy (1-S)**, **final accuracy (Fin)** after repair loops, and the total **token budget (Tok)** required for successful generation.
+
+| Model | J (1-S) | J (Fin) | J (Tok) | JSO (1-S) | JSO (Fin) | JSO (Tok) | T (1-S) | T (Fin) | T (Tok) |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| **NousResearch/Hermes-4-405B** | 92.5% | 92.5% | 3252 | 35.0% | **100%** | 4759 | 50.0% | 60.0% | 4671 |
+| **NousResearch/Hermes-4-70B** | 75.0% | 75.0% | 4414 | 37.5% | 75.0% | 5594 | 50.0% | 50.0% | 4738 |
+| **PrimeIntellect/INTELLECT-3** | 72.5% | 75.0% | 10682 | 72.5% | 77.5% | 10103 | 40.0% | 65.0% | 13315 |
+| **Qwen/Qwen2.5-Coder-7B-fast** | 0.0% | 0.0% | 37705 | 75.0% | 75.0% | 4440 | 27.5% | 27.5% | 32715 |
+| **Qwen/Qwen3-235B-A22B-Inst** | **100%** | **100%** | 2772 | **100%** | **100%** | 2772 | 50.0% | **100%** | 4715 |
+| **Qwen/Qwen3-235B-A22B-Thk** | 82.5% | 82.5% | 11425 | 87.5% | 97.5% | 7899 | 50.0% | 97.5% | 17457 |
+| **Qwen/Qwen3-30B-A3B-Inst** | 75.0% | 75.0% | 4436 | 75.0% | 75.0% | 4436 | 50.0% | 70.0% | 5505 |
+| **Qwen/Qwen3-32B** | 75.0% | 77.5% | 10196 | 75.0% | 75.0% | 4120 | 47.5% | 80.0% | 9101 |
+| **Qwen/Qwen3-Coder-30B-A3B** | 75.0% | 75.0% | 4206 | 75.0% | 75.0% | 4206 | 50.0% | **100%** | 4719 |
+| **Qwen/Qwen3-Coder-480B** | 75.0% | 75.0% | 4462 | 75.0% | 75.0% | 4447 | 50.0% | 75.0% | 4515 |
+| **deepseek-ai/DeepSeek-R1** | 55.0% | 70.0% | 13811 | 65.0% | 80.0% | 4149 | 25.0% | 50.0% | 19047 |
+| **deepseek-ai/DeepSeek-V3-fast** | 75.0% | **100%** | 3600 | 75.0% | **100%** | 3584 | 25.0% | 80.0% | 4734 |
+| **google/gemma-2-2b-it** | 75.0% | **100%** | 4721 | 77.5% | **100\%** | 4566 | 0.0% | 0.0% | 5955 |
+| **google/gemma-2-9b-it-fast** | 75.0% | 75.0% | 6086 | 75.0% | 75.0% | 6056 | 50.0% | 75.0% | 5419 |
+| **meta-llama/Llama-3.3-70B** | 75.0% | 75.0% | 4551 | 75.0% | 75.0% | 4447 | 50.0% | 50.0% | 5148 |
+| **meta-llama/Llama-3.1-8B** | 72.5% | 72.5% | 7235 | 75.0% | 75.0% | 6941 | 22.5\% | 25.0% | 4915 |
+| **moonshotai/Kimi-K2-Instruct** | 50.0% | 75.0% | 4284 | 50.0% | 75.0% | 4283 | 50.0\% | **100\%** | 3937 |
+| **nvidia/Llama-3_1-Nemotron** | 75.0% | 75.0% | 4426 | 50.0% | 50.0% | 5714 | 50.0% | 82.5% | 4368 |
+| **openai/gpt-oss-120b** | **97.5%** | **100%** | 3685 | **100%** | **100%** | 3545 | 50.0% | 87.5% | 8223 |
+| **openai/gpt-oss-20b** | 50.0% | 72.5% | 14943 | 50.0% | 67.5% | 15601 | 50.0% | 90.0% | 9678 |
+| **zai-org/GLM-4.5** | 75.0% | 87.5% | 9677 | 75.0\% | 92.5\% | 9135 | 27.5\% | 52.5\% | 8110 |
+
+### Observations
+
+**1. The "Structured Output Paradox"**
+Constrained decoding is not always superior. For `Hermes-4-405B`, applying constraints dropped 1-shot accuracy from **92.5%** (Plain JSON) to **35.0%** (Structured Output). This suggests that for some high-reasoning models, forcing specific grammar paths can actively interfere with the model's logic capabilities.
+
+**2. Guardrails for smaller models**
+Conversely, for smaller models like `Qwen/Qwen2.5-Coder-7B-fast`, structured output is essential. It raised performance from a catastrophic **0%** (Plain JSON) to a viable **75%**.
+
+**3. TOON repair potential**
+While TOON often has lower initial 1-shot accuracy due to the novelty of the format, several models (`Qwen/Qwen3-Coder-30B`, `Kimi-K2-Instruct`, `Qwen/Qwen3-235B`) achieved **100% final accuracy** after repair loops. This indicates that while the format may be unfamiliar initially, the error messages provided by the TOON CLI are highly effective for self-correction.
+
+**4. Token efficiency scaling**
+In cases like `Qwen3-235B-A22B-Inst`, TOON consumed significantly more tokens (~4700) than JSON (~2700). This confirms the "prompt tax" hypothesis: for short tasks, the instructional overhead outweighs the syntax savings. TOON becomes efficient primarily in high-volume generation where the output length justifies the system prompt.
+
+### Analysis & recommendations
+
+1.  **Aligned data streams:** Use TOON generation for **SQL dumps, logs, and transactional documents**. The token savings on high-volume, uniform data outweigh the prompt overhead.
+2.  **Avoid deep nesting:** For deeply nested or recursive state trees (like DOMs), stick to **JSON** or **JSO**. TOON's indentation tracking is less robust for these structures in one-shot generation.
+3.  **Repair loops:** TOON generation benefits disproportionately from repair loops (feeding errors back to context), often correcting format issues that initial constrained decoding cannot fix.
+
+<details>
+<summary><strong>Installation & Usage (click to expand)</strong></summary>
+
+<br>
+
+This repository contains two main scripts:
+
+- **`generate.py`** – builds the gold-standard reference outputs used for evaluation.  
+- **`eval.py`** – runs the full benchmark across all models and decoding strategies.
+
+Before running the benchmark, install dependencies and create the gold files.
+
+### **1. Install Python dependencies**
+
+```bash
+pip install -r requirements.txt
+```
+
+### **2. Install the TOON CLI (required for encoding/decoding)**
+
+```bash
+npm install -g @toon-format/cli
+```
+
+Alternatively, you can rely on `npx` without a global install.
+
+### **3. Generate the gold reference outputs**
+
+This step must be run **once**, or whenever you modify the schemas in `generate.py`:
+
+```bash
+python generate.py
+```
+
+This will create:
+
+```
+gold/users.gold.json       gold/users.gold.toon
+gold/order.gold.json       gold/order.gold.toon
+gold/company.gold.json     gold/company.gold.toon
+gold/invoice.gold.json     gold/invoice.gold.toon
+```
+
+### **4. Set your model API key**
+
+The benchmark uses the Nebius Token Factory API. Set:
+
+```bash
+export LLM_API_KEY="your_nebius_api_key"
+```
+
+### **5. Run the full benchmark**
+
+```bash
+python eval.py
+```
+
+This will:
+
+- Run all test cases for 21 models  
+- Perform JSON, JSON-SO, and TOON generation  
+- Decode TOON outputs via CLI  
+- Validate against the gold standard  
+- Apply repair loops  
+- Write per-run statistics to:
+
+```
+eval_runs.csv
+```
+
+### **Repository structure**
+
+```
+├── generate.py          # Defines schemas, builds gold objects, writes gold/*.json + *.toon
+├── eval.py       # Full benchmark runner
+├── gold/                # Auto-generated canonical reference data
+│   ├── *.gold.json
+│   ├── *.gold.toon
+├── requirements.txt
+└── README.md
+```
+
+</details>
diff --git a/benchmarks/generation/eval.py b/benchmarks/generation/eval.py
new file mode 100644
index 00000000..53d65a24
--- /dev/null
+++ b/benchmarks/generation/eval.py
@@ -0,0 +1,876 @@
+# eval_simple.py
+import json
+import os
+import re
+import csv
+import subprocess
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Tuple, Type
+
+from pydantic import BaseModel, TypeAdapter
+from openai import OpenAI
+from openai import APIError, InternalServerError, RateLimitError
+
+# --- Import Pydantic models from your generate.py ---
+from generate import (
+    UserRow, Order,
+    Company, Invoice,
+)
+
+# =========================================
+# Config: models + runs + output CSV
+# =========================================
+MODELS = [
+'deepseek-ai/DeepSeek-V3-0324-fast',
+'openai/gpt-oss-120b',
+'moonshotai/Kimi-K2-Instruct',
+'Qwen/Qwen3-Coder-480B-A35B-Instruct',
+'NousResearch/Hermes-4-405B',
+'NousResearch/Hermes-4-70B',
+'openai/gpt-oss-20b',
+'zai-org/GLM-4.5',
+'deepseek-ai/DeepSeek-R1-0528',
+'PrimeIntellect/INTELLECT-3',
+'Qwen/Qwen3-235B-A22B-Thinking-2507',
+'Qwen/Qwen3-235B-A22B-Instruct-2507',
+'Qwen/Qwen3-30B-A3B-Instruct-2507',
+'Qwen/Qwen3-Coder-30B-A3B-Instruct',
+'Qwen/Qwen3-32B',
+'nvidia/Llama-3_1-Nemotron-Ultra-253B-v1',
+'meta-llama/Llama-3.3-70B-Instruct',
+'meta-llama/Meta-Llama-3.1-8B-Instruct',
+'Qwen/Qwen2.5-Coder-7B-fast',
+'google/gemma-2-2b-it',
+'google/gemma-2-9b-it-fast'
+]
+RUNS_PER_MODEL = 10
+CSV_PATH = Path("eval_runs.csv")
+
+# =========================================
+# LLM client
+# =========================================
+LLM_API_KEY = os.environ.get("LLM_API_KEY")
+if not LLM_API_KEY:
+    raise RuntimeError("Missing LLM_API_KEY environment variable")
+
+client = OpenAI(
+    base_url="https://api.studio.nebius.com/v1/",
+    api_key=LLM_API_KEY,
+)
+
+SYSTEM_PROMPT = (
+    "You are a data-formatting model. "
+    "Follow instructions exactly. When asked for JSON, you must return JSON that conforms to the provided JSON Schema. "
+    "No extra text. When asked for TOON, return only a ```toon fenced block."
+)
+
+# =========================================
+# Retry wrapper for API calls
+# =========================================
+def retry_on_error(func, max_retries=5, initial_delay=2.0):
+    """Retry a function with exponential backoff on API errors."""
+    for attempt in range(max_retries):
+        try:
+            return func()
+        except (InternalServerError, APIError, RateLimitError) as e:
+            if attempt == max_retries - 1:
+                print(f"Failed after {max_retries} attempts: {e}")
+                raise
+            
+            delay = initial_delay * (2 ** attempt)
+            print(f"API error (attempt {attempt + 1}/{max_retries}): {e}")
+            print(f"Retrying in {delay:.1f} seconds...")
+            time.sleep(delay)
+        except Exception as e:
+            # Don't retry on other exceptions (validation errors, etc.)
+            raise
+
+# =========================================
+# Structured JSON call (json_schema)
+# =========================================
+def llm_call_json_structured(model: str, prompt: str, schema_model: Type[BaseModel]) -> Tuple[str, int, int]:
+    """Return (json_text, prompt_tokens, completion_tokens) with JSON object output."""
+    print(f"Calling {model} json_structured")
+    # Add schema to prompt for guidance
+    schema_prompt = f"{prompt}\n\nReturn valid JSON matching this schema:\n{json.dumps(schema_model.model_json_schema(), indent=2)}"
+    
+    def _call():
+        resp = client.chat.completions.create(
+            model=model,
+            max_tokens=10000,
+            temperature=0.0,
+            top_p=1.0,
+            extra_body={"top_k": 50},
+            response_format={
+                "type": "json_object",
+            },
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": schema_prompt},
+            ],
+        )
+        
+        msg = resp.choices[0].message
+        
+        # Handle refusal
+        if msg.refusal:
+            raise ValueError(f"Model refused: {msg.refusal}")
+        
+        text = (msg.content or "").strip()
+        usage = getattr(resp, "usage", None)
+        p = getattr(usage, "prompt_tokens", 0) if usage else 0
+        c = getattr(usage, "completion_tokens", 0) if usage else 0
+        
+        return text, p, c
+    
+    return retry_on_error(_call)
+
+# =========================================
+# Plain JSON call (no response_format)
+# =========================================
+def llm_call_json_plain(model: str, prompt: str, schema_model: Type[BaseModel]) -> Tuple[str, int, int]:
+    """Return (json_text, prompt_tokens, completion_tokens) with plain text completion."""
+    print(f"Calling {model} json_plain")
+    # Add schema to prompt for guidance
+    schema_prompt = f"{prompt}\n\nReturn valid JSON matching this schema:\n{json.dumps(schema_model.model_json_schema(), indent=2)}"
+    
+    def _call():
+        resp = client.chat.completions.create(
+            model=model,
+            max_tokens=10000,
+            temperature=0.0,
+            top_p=1.0,
+            extra_body={"top_k": 50},
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": schema_prompt},
+            ],
+        )
+        
+        text = resp.choices[0].message.content or ""
+        # Remove think tags
+        text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
+        # Remove markdown code fences if present
+        text = re.sub(r"```(?:json)?\s*(.*?)```", r"\1", text, flags=re.DOTALL).strip()
+        
+        usage = getattr(resp, "usage", None)
+        p = getattr(usage, "prompt_tokens", 0) if usage else 0
+        c = getattr(usage, "completion_tokens", 0) if usage else 0
+        
+        return text, p, c
+    
+    return retry_on_error(_call)
+
+# =========================================
+# Plain call (for TOON generation)
+# =========================================
+def llm_call_plain(model: str, prompt: str) -> Tuple[str, int, int]:
+    print(f"Calling {model} plain")
+    
+    def _call():
+        resp = client.chat.completions.create(
+            model=model,
+            max_tokens=10000,
+            temperature=0.0,
+            top_p=1.0,
+            extra_body={"top_k": 50},
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": prompt},
+            ],
+        )
+        text = resp.choices[0].message.content or ""
+        text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
+        usage = getattr(resp, "usage", None)
+        p = getattr(usage, "prompt_tokens", 0) if usage else 0
+        c = getattr(usage, "completion_tokens", 0) if usage else 0
+        return text, p, c
+    
+    return retry_on_error(_call)
+
+# =========================================
+# Paths
+# =========================================
+GOLD = Path("gold")
+USERS_JSON   = GOLD / "users.gold.json"
+ORDER_JSON   = GOLD / "order.gold.json"
+COMPANY_JSON = GOLD / "company.gold.json"
+INVOICE_JSON = GOLD / "invoice.gold.json"
+
+# =========================================
+# Canonicalization (stable compare)
+# =========================================
+def sort_users_by_id(obj: Dict[str, Any]) -> Dict[str, Any]:
+    if "users" in obj and isinstance(obj["users"], list):
+        obj["users"] = sorted(obj["users"], key=lambda r: r.get("id"))
+    return obj
+
+def sort_order_items(obj: Dict[str, Any]) -> Dict[str, Any]:
+    if "items" in obj and isinstance(obj["items"], list):
+        obj["items"] = sorted(obj["items"], key=lambda r: r.get("sku"))
+    return obj
+
+def sort_company(obj: Dict[str, Any]) -> Dict[str, Any]:
+    if "departments" in obj and isinstance(obj["departments"], list):
+        obj["departments"] = sorted(obj["departments"], key=lambda d: d.get("code"))
+        for d in obj["departments"]:
+            if isinstance(d, dict) and "employees" in d and isinstance(d["employees"], list):
+                d["employees"] = sorted(d["employees"], key=lambda e: e.get("id"))
+    return obj
+
+def sort_invoice(obj: Dict[str, Any]) -> Dict[str, Any]:
+    if "items" in obj and isinstance(obj["items"], list):
+        obj["items"] = sorted(obj["items"], key=lambda r: r.get("sku"))
+    return obj
+
+def canonical_json(obj: Any, case: str) -> Any:
+    if case == "users":   return sort_users_by_id(obj)
+    if case == "order":   return sort_order_items(obj)
+    if case == "company": return sort_company(obj)
+    if case == "invoice": return sort_invoice(obj)
+    return obj
+
+# =========================================
+# Pydantic validation (+ shape normalization)
+# =========================================
+class UsersPayload(BaseModel):
+    users: List[UserRow]
+
+def validate_users_json(data: Any) -> List[UserRow]:
+    if not isinstance(data, dict) or "users" not in data:
+        raise ValueError("Expected object with key 'users'")
+    adapter = TypeAdapter(List[UserRow])
+    return adapter.validate_python(data["users"])
+
+def normalize_by_key(data: Any, key: str) -> Any:
+    if isinstance(data, dict) and key in data and isinstance(data[key], dict):
+        return data[key]
+    return data
+
+def validate_order_json(data: Any) -> Order:
+    data = normalize_by_key(data, "order")  # TOON may wrap
+    adapter = TypeAdapter(Order)
+    return adapter.validate_python(data)
+
+def validate_company_json(data: Any) -> Company:
+    data = normalize_by_key(data, "company")
+    adapter = TypeAdapter(Company)
+    return adapter.validate_python(data)
+
+def validate_invoice_json(data: Any) -> Invoice:
+    data = normalize_by_key(data, "invoice")
+    adapter = TypeAdapter(Invoice)
+    return adapter.validate_python(data)
+
+# =========================================
+# TOON decode via official CLI
+# =========================================
+def extract_toon_payload(toon_text: str) -> str:
+    m = re.search(r"```toon\s*(.*?)```", toon_text, flags=re.DOTALL | re.IGNORECASE)
+    return m.group(1).strip() if m else toon_text.strip()
+
+def decode_toon_to_json(toon_text: str) -> Any:
+    payload = extract_toon_payload(toon_text)
+    proc = subprocess.run(
+        ["npx", "@toon-format/cli", "--decode"],
+        input=payload.encode("utf-8"),
+        capture_output=True,
+        check=True,
+    )
+    return json.loads(proc.stdout.decode("utf-8"))
+
+# =========================================
+# Prompts — JSON (structured) / TOON
+# =========================================
+def make_json_prompt_users() -> str:
+    return (
+        "Create a user directory with three users:\n"
+        "- User 1: Alice, who is an admin\n"
+        "- User 2: Bob, who is a staff member\n"
+        "- User 3: Eve, who is a guest\n\n"
+        "Return the data as JSON with a 'users' array containing objects with id, name, and role fields."
+    )
+
+def make_json_prompt_order() -> str:
+    return (
+        "Create an order record:\n"
+        "- Order ID: 101\n"
+        "- Customer: Ada (ID: 9)\n"
+        "- Items:\n"
+        "  * Product A1: quantity 2, price $9.99 each\n"
+        "  * Product B2: quantity 1, price $14.50 each\n\n"
+        "Return as JSON with fields for id, customer (with id and name), and items array (with sku, qty, price)."
+    )
+
+def make_json_prompt_company() -> str:
+    return (
+        "Create a company organization structure:\n"
+        "- Company: Acme (ID: 1)\n"
+        "- Engineering Department (code: ENG):\n"
+        "  * Alice (ID: 1) - engineer\n"
+        "  * Bob (ID: 2) - manager\n"
+        "- Operations Department (code: OPS):\n"
+        "  * Eve (ID: 3) - analyst\n\n"
+        "Return as JSON with company info and nested departments array, each containing employees."
+    )
+
+def make_json_prompt_invoice() -> str:
+    return (
+        "Create an invoice:\n"
+        "- Invoice number: INV-2025-001\n"
+        "- Currency: USD\n"
+        "- Customer: Ada (ID: 9)\n"
+        "- Line items:\n"
+        "  * A1: quantity 2 @ $9.99 each = $19.98\n"
+        "  * B2: quantity 1 @ $14.50 each = $14.50\n"
+        "- Subtotal: $34.48\n"
+        "- Tax: $6.90\n"
+        "- Grand total: $41.38\n"
+        "- Notes: Thank you for your business.\n\n"
+        "Return as JSON with all invoice details including items array and totals breakdown."
+    )
+
+# =========================================
+# Improved TOON Prompts (short nested examples + same tasks)
+# =========================================
+
+
+
+
+
+
+def make_toon_prompt_users() -> str:
+    return (
+        "You are to produce output STRICTLY in TOON format.\n\n"
+        "TOON RULES:\n"
+        "- Use 2-space indentation\n"
+        "- Scalars: fieldName: value\n"
+        "- Objects: fieldName: then nested fields indented\n"
+        "- Arrays of objects:\n"
+        "    arrayName[N]:\n"
+        "      - field1: value1\n"
+        "        field2: value2\n"
+        "- Tabular arrays (for simple data):\n"
+        "    arrayName[N]{field1,field2}:\n"
+        "      val1,val2\n"
+        "      val3,val4\n"
+        "- [N] MUST equal actual row/item count\n"
+        "- Output ONLY a ```toon code block\n\n"
+        "Reference example:\n"
+        "```toon\n"
+        "id: 100\n"
+        "type: Sample\n"
+        "metadata:\n"
+        "  version: 1\n"
+        "  author: Alex\n"
+        "sections[2]:\n"
+        "  - code: A\n"
+        "    title: Introduction\n"
+        "    items[2]{id,value}:\n"
+        "      1,First\n"
+        "      2,Second\n"
+        "  - code: B\n"
+        "    title: Details\n"
+        "    items[1]{id,value}:\n"
+        "      3,Third\n"
+        "summary:\n"
+        "  total: 3\n"
+        "  status: complete\n"
+        "```\n\n"
+        "TASK:\n"
+        "Create an array named users with fields id, name, and role.\n"
+        "User data:\n"
+        "- id=1, name=Alice, role=admin\n"
+        "- id=2, name=Bob, role=staff\n"
+        "- id=3, name=Eve, role=guest\n\n"
+        "Output only the TOON code block.\n"
+    )
+
+
+
+
+def make_toon_prompt_order() -> str:
+    return (
+        "You are to produce output STRICTLY in TOON format.\n\n"
+        "TOON RULES:\n"
+        "- Use 2-space indentation\n"
+        "- Scalars: fieldName: value\n"
+        "- Objects: fieldName: then nested fields indented\n"
+        "- Arrays of objects:\n"
+        "    arrayName[N]:\n"
+        "      - field1: value1\n"
+        "        field2: value2\n"
+        "- Tabular arrays (for simple data):\n"
+        "    arrayName[N]{field1,field2}:\n"
+        "      val1,val2\n"
+        "      val3,val4\n"
+        "- [N] MUST equal actual row/item count\n"
+        "- Output ONLY a ```toon code block\n\n"
+        "Reference example:\n"
+        "```toon\n"
+        "id: 100\n"
+        "type: Sample\n"
+        "metadata:\n"
+        "  version: 1\n"
+        "  author: Alex\n"
+        "sections[2]:\n"
+        "  - code: A\n"
+        "    title: Introduction\n"
+        "    items[2]{id,value}:\n"
+        "      1,First\n"
+        "      2,Second\n"
+        "  - code: B\n"
+        "    title: Details\n"
+        "    items[1]{id,value}:\n"
+        "      3,Third\n"
+        "summary:\n"
+        "  total: 3\n"
+        "  status: complete\n"
+        "```\n\n"
+        "TASK:\n"
+        "Create an order record with fields: id, customer (with id and name), "
+        "and items array (with sku, qty, price).\n"
+        "- Order ID: 101\n"
+        "- Customer: Ada (ID: 9)\n"
+        "- Items:\n"
+        "  * Product A1: quantity 2, price $9.99 each\n"
+        "  * Product B2: quantity 1, price $14.50 each\n"
+    )
+
+
+def make_toon_prompt_company() -> str:
+    return (
+        "You are to produce output STRICTLY in TOON format.\n\n"
+        "TOON RULES:\n"
+        "- Use 2-space indentation\n"
+        "- Scalars: fieldName: value\n"
+        "- Objects: fieldName: then nested fields indented\n"
+        "- Arrays of objects:\n"
+        "    arrayName[N]:\n"
+        "      - field1: value1\n"
+        "        field2: value2\n"
+        "- Tabular arrays (for simple data):\n"
+        "    arrayName[N]{field1,field2}:\n"
+        "      val1,val2\n"
+        "      val3,val4\n"
+        "- [N] MUST equal actual row/item count\n"
+        "- Output ONLY a ```toon code block\n\n"
+        "Reference example:\n"
+        "```toon\n"
+        "id: 100\n"
+        "type: Sample\n"
+        "metadata:\n"
+        "  version: 1\n"
+        "  author: Alex\n"
+        "sections[2]:\n"
+        "  - code: A\n"
+        "    title: Introduction\n"
+        "    items[2]{id,value}:\n"
+        "      1,First\n"
+        "      2,Second\n"
+        "  - code: B\n"
+        "    title: Details\n"
+        "    items[1]{id,value}:\n"
+        "      3,Third\n"
+        "summary:\n"
+        "  total: 3\n"
+        "  status: complete\n"
+        "```\n\n"
+        "TASK:\n"
+        "Create a company organization structure with company info and nested departments array, each containing employees:\n"
+        "- Company: Acme (ID: 1)\n"
+        "- Engineering Department (code: ENG):\n"
+        "  * Alice (ID: 1) - engineer\n"
+        "  * Bob (ID: 2) - manager\n"
+        "- Operations Department (code: OPS):\n"
+        "  * Eve (ID: 3) - analyst\n\n"
+    )
+
+
+def make_toon_prompt_invoice() -> str:
+    return (
+        "You are to produce output STRICTLY in TOON format.\n\n"
+        "TOON RULES:\n"
+        "- Use 2-space indentation\n"
+        "- Scalars: fieldName: value\n"
+        "- Objects: fieldName: then nested fields indented\n"
+        "- Arrays of objects:\n"
+        "    arrayName[N]:\n"
+        "      - field1: value1\n"
+        "        field2: value2\n"
+        "- Tabular arrays (for simple data):\n"
+        "    arrayName[N]{field1,field2}:\n"
+        "      val1,val2\n"
+        "      val3,val4\n"
+        "- [N] MUST equal actual row/item count\n"
+        "- Output ONLY a ```toon code block\n\n"
+        "Reference example:\n"
+        "```toon\n"
+        "id: 100\n"
+        "type: Sample\n"
+        "metadata:\n"
+        "  version: 1\n"
+        "  author: Alex\n"
+        "sections[2]:\n"
+        "  - code: A\n"
+        "    title: Introduction\n"
+        "    items[2]{id,value}:\n"
+        "      1,First\n"
+        "      2,Second\n"
+        "  - code: B\n"
+        "    title: Details\n"
+        "    items[1]{id,value}:\n"
+        "      3,Third\n"
+        "summary:\n"
+        "  total: 3\n"
+        "  status: complete\n"
+        "```\n\n"
+        "TASK:\n"
+        "Create an invoice with all invoice details including items array and totals breakdown:\n"
+        "- Invoice number: INV-2025-001\n"
+        "- Currency: USD\n"
+        "- Customer: Ada (ID: 9)\n"
+        "- Line items:\n"
+        "  * A1: quantity 2 @ $9.99 each = $19.98\n"
+        "  * B2: quantity 1 @ $14.50 each = $14.50\n"
+        "- Subtotal: $34.48\n"
+        "- Tax: $6.90\n"
+        "- Grand total: $41.38\n"
+        "- Notes: Thank you for your business.\n"
+    )
+# =========================================
+# Repair prompts
+# =========================================
+def make_json_repair_prompt(prev_output: str, error_msg: str) -> str:
+    return (
+        "Your previous JSON did not validate against the schema. "
+        "Return ONLY valid JSON (no prose, no fences) that matches the schema and the target values.\n"
+        f"Validation error:\n{error_msg}\n\n"
+        "Previous output:\n"
+        f"{prev_output}\n"
+    )
+
+def make_toon_repair_prompt(prev_output: str, error_msg: str) -> str:
+    return (
+        "Your previous TOON was invalid. Return ONLY a ```toon fenced block.\n"
+        "- Use 2-space indentation; no trailing spaces.\n"
+        "- Ensure headers/fieldsets and [N] match row counts.\n"
+        f"Validation/decoding error:\n{error_msg}\n\n"
+        "Previous output:\n"
+        f"{prev_output}\n"
+    )
+
+# =========================================
+# Core evaluation (one-shot + ≤9 repairs)
+# =========================================
+MAX_ATTEMPTS = 3
+
+def eval_json_track(
+    model: str,
+    make_prompt_fn,
+    schema_model: Type[BaseModel],
+    validate_fn,
+    gold_obj,
+    canon_case: str,
+):
+    tokens_p = tokens_c = 0
+    prompt = make_prompt_fn()
+    out, p, c = llm_call_json_structured(model, prompt, schema_model); tokens_p += p; tokens_c += c
+    try:
+        parsed = json.loads(out)
+        # print(f"JSON SO parsed: {parsed}")
+        validate_fn(parsed)  # Pydantic
+        parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case)
+        one_shot_ok = final_ok = (parsed == gold_obj)
+        if final_ok:
+            return dict(one_shot_ok=True, final_ok=True, attempts_used=1,
+                        tokens_prompt=tokens_p, tokens_completion=tokens_c)
+    except Exception as e:
+        err = str(e); one_shot_ok = False; prev = out
+    else:
+        err = "Structure valid but values differ from expected gold."; prev = out
+
+    for i in range(1, MAX_ATTEMPTS):
+        repair_prompt = make_json_repair_prompt(prev, err)
+        out, p, c = llm_call_json_structured(model, repair_prompt, schema_model); tokens_p += p; tokens_c += c
+        try:
+            parsed = json.loads(out)
+            validate_fn(parsed)
+            parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case)
+            final_ok = (parsed == gold_obj)
+            if final_ok:
+                return dict(one_shot_ok=one_shot_ok, final_ok=True, attempts_used=i+1,
+                            tokens_prompt=tokens_p, tokens_completion=tokens_c)
+            else:
+                err = "Structure valid but values differ from expected gold."
+                prev = out
+        except Exception as e:
+            err = str(e); prev = out
+            continue
+
+    return dict(one_shot_ok=one_shot_ok, final_ok=False, attempts_used=MAX_ATTEMPTS,
+                tokens_prompt=tokens_p, tokens_completion=tokens_c)
+
+def eval_json_plain_track(
+    model: str,
+    make_prompt_fn,
+    schema_model: Type[BaseModel],
+    validate_fn,
+    gold_obj,
+    canon_case: str,
+):
+    """Evaluate JSON generation without response_format (plain completion)."""
+    tokens_p = tokens_c = 0
+    prompt = make_prompt_fn()
+    out, p, c = llm_call_json_plain(model, prompt, schema_model); tokens_p += p; tokens_c += c
+    try:
+        parsed = json.loads(out)
+        # print(f"JSON plain parsed: {parsed}")
+        validate_fn(parsed)  # Pydantic
+        parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case)
+        one_shot_ok = final_ok = (parsed == gold_obj)
+        if final_ok:
+            return dict(one_shot_ok=True, final_ok=True, attempts_used=1,
+                        tokens_prompt=tokens_p, tokens_completion=tokens_c)
+    except Exception as e:
+        err = str(e); one_shot_ok = False; prev = out
+    else:
+        err = "Structure valid but values differ from expected gold."; prev = out
+
+    for i in range(1, MAX_ATTEMPTS):
+        repair_prompt = make_json_repair_prompt(prev, err)
+        out, p, c = llm_call_json_plain(model, repair_prompt, schema_model); tokens_p += p; tokens_c += c
+        try:
+            parsed = json.loads(out)
+            validate_fn(parsed)
+            parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case)
+            final_ok = (parsed == gold_obj)
+            if final_ok:
+                return dict(one_shot_ok=one_shot_ok, final_ok=True, attempts_used=i+1,
+                            tokens_prompt=tokens_p, tokens_completion=tokens_c)
+            else:
+                err = "Structure valid but values differ from expected gold."
+                prev = out
+        except Exception as e:
+            err = str(e); prev = out
+            continue
+
+    return dict(one_shot_ok=one_shot_ok, final_ok=False, attempts_used=MAX_ATTEMPTS,
+                tokens_prompt=tokens_p, tokens_completion=tokens_c)
+
+def eval_toon_track(model: str, make_prompt_fn, validate_fn, gold_obj, canon_case: str):
+    tokens_p = tokens_c = 0
+    prompt = make_prompt_fn()
+    out, p, c = llm_call_plain(model, prompt); tokens_p += p; tokens_c += c
+    try:
+        decoded = decode_toon_to_json(out)
+        # print(f"TOON decoded: {decoded}")
+        validate_fn(decoded)
+        decoded = canonical_json(normalize_by_key(decoded, canon_case), canon_case)
+        one_shot_ok = final_ok = (decoded == gold_obj)
+        if final_ok:
+            return dict(one_shot_ok=True, final_ok=True, attempts_used=1,
+                        tokens_prompt=tokens_p, tokens_completion=tokens_c)
+    except Exception as e:
+        err = str(e); one_shot_ok = False; prev = out
+    else:
+        err = "Structure valid but values differ from expected gold."; prev = out
+
+    for i in range(1, MAX_ATTEMPTS):
+        repair_prompt = make_toon_repair_prompt(prev, err)
+        out, p, c = llm_call_plain(model, repair_prompt); tokens_p += p; tokens_c += c
+        try:
+            decoded = decode_toon_to_json(out)
+            validate_fn(decoded)
+            decoded = canonical_json(normalize_by_key(decoded, canon_case), canon_case)
+            final_ok = (decoded == gold_obj)
+            if final_ok:
+                return dict(one_shot_ok=one_shot_ok, final_ok=True, attempts_used=i+1,
+                            tokens_prompt=tokens_p, tokens_completion=tokens_c)
+            else:
+                err = "Structure valid but values differ from expected gold."
+                prev = out
+        except Exception as e:
+            err = str(e); prev = out
+            continue
+
+    return dict(one_shot_ok=one_shot_ok, final_ok=False, attempts_used=MAX_ATTEMPTS,
+                tokens_prompt=tokens_p, tokens_completion=tokens_c)
+
+# =========================================
+# Case runners aggregating metrics
+# =========================================
+def run_case_users(model: str):
+    gold = json.loads(USERS_JSON.read_text(encoding="utf-8"))
+    gold = canonical_json(gold, "users")
+    jm = eval_json_track(model, make_json_prompt_users, UsersPayload, validate_users_json, gold, "users")
+    jpm = eval_json_plain_track(model, make_json_prompt_users, UsersPayload, validate_users_json, gold, "users")
+    tm = eval_toon_track(model, make_toon_prompt_users, validate_users_json, gold, "users")
+    return {
+        "users_json_one_shot": jm["one_shot_ok"], "users_json_final": jm["final_ok"],
+        "users_json_attempts": jm["attempts_used"],
+        "users_json_tokens_prompt": jm["tokens_prompt"], "users_json_tokens_completion": jm["tokens_completion"],
+        "users_json_plain_one_shot": jpm["one_shot_ok"], "users_json_plain_final": jpm["final_ok"],
+        "users_json_plain_attempts": jpm["attempts_used"],
+        "users_json_plain_tokens_prompt": jpm["tokens_prompt"], "users_json_plain_tokens_completion": jpm["tokens_completion"],
+        "users_toon_one_shot": tm["one_shot_ok"], "users_toon_final": tm["final_ok"],
+        "users_toon_attempts": tm["attempts_used"],
+        "users_toon_tokens_prompt": tm["tokens_prompt"], "users_toon_tokens_completion": tm["tokens_completion"],
+    }
+
+def run_case_order(model: str):
+    gold = json.loads(ORDER_JSON.read_text(encoding="utf-8"))
+    gold = canonical_json(gold, "order")
+    jm = eval_json_track(model, make_json_prompt_order, Order, validate_order_json, gold, "order")
+    jpm = eval_json_plain_track(model, make_json_prompt_order, Order, validate_order_json, gold, "order")
+    tm = eval_toon_track(model, make_toon_prompt_order, validate_order_json, gold, "order")
+    return {
+        "order_json_one_shot": jm["one_shot_ok"], "order_json_final": jm["final_ok"],
+        "order_json_attempts": jm["attempts_used"],
+        "order_json_tokens_prompt": jm["tokens_prompt"], "order_json_tokens_completion": jm["tokens_completion"],
+        "order_json_plain_one_shot": jpm["one_shot_ok"], "order_json_plain_final": jpm["final_ok"],
+        "order_json_plain_attempts": jpm["attempts_used"],
+        "order_json_plain_tokens_prompt": jpm["tokens_prompt"], "order_json_plain_tokens_completion": jpm["tokens_completion"],
+        "order_toon_one_shot": tm["one_shot_ok"], "order_toon_final": tm["final_ok"],
+        "order_toon_attempts": tm["attempts_used"],
+        "order_toon_tokens_prompt": tm["tokens_prompt"], "order_toon_tokens_completion": tm["tokens_completion"],
+    }
+
+def run_case_company(model: str):
+    gold = json.loads(COMPANY_JSON.read_text(encoding="utf-8"))
+    gold = canonical_json(gold, "company")
+    jm = eval_json_track(model, make_json_prompt_company, Company, validate_company_json, gold, "company")
+    jpm = eval_json_plain_track(model, make_json_prompt_company, Company, validate_company_json, gold, "company")
+    tm = eval_toon_track(model, make_toon_prompt_company, validate_company_json, gold, "company")
+    return {
+        "company_json_one_shot": jm["one_shot_ok"], "company_json_final": jm["final_ok"],
+        "company_json_attempts": jm["attempts_used"],
+        "company_json_tokens_prompt": jm["tokens_prompt"], "company_json_tokens_completion": jm["tokens_completion"],
+        "company_json_plain_one_shot": jpm["one_shot_ok"], "company_json_plain_final": jpm["final_ok"],
+        "company_json_plain_attempts": jpm["attempts_used"],
+        "company_json_plain_tokens_prompt": jpm["tokens_prompt"], "company_json_plain_tokens_completion": jpm["tokens_completion"],
+        "company_toon_one_shot": tm["one_shot_ok"], "company_toon_final": tm["final_ok"],
+        "company_toon_attempts": tm["attempts_used"],
+        "company_toon_tokens_prompt": tm["tokens_prompt"], "company_toon_tokens_completion": tm["tokens_completion"],
+    }
+
+def run_case_invoice(model: str):
+    gold = json.loads(INVOICE_JSON.read_text(encoding="utf-8"))
+    gold = canonical_json(gold, "invoice")
+    jm = eval_json_track(model, make_json_prompt_invoice, Invoice, validate_invoice_json, gold, "invoice")
+    jpm = eval_json_plain_track(model, make_json_prompt_invoice, Invoice, validate_invoice_json, gold, "invoice")
+    tm = eval_toon_track(model, make_toon_prompt_invoice, validate_invoice_json, gold, "invoice")
+    return {
+        "invoice_json_one_shot": jm["one_shot_ok"], "invoice_json_final": jm["final_ok"],
+        "invoice_json_attempts": jm["attempts_used"],
+        "invoice_json_tokens_prompt": jm["tokens_prompt"], "invoice_json_tokens_completion": jm["tokens_completion"],
+        "invoice_json_plain_one_shot": jpm["one_shot_ok"], "invoice_json_plain_final": jpm["final_ok"],
+        "invoice_json_plain_attempts": jpm["attempts_used"],
+        "invoice_json_plain_tokens_prompt": jpm["tokens_prompt"], "invoice_json_plain_tokens_completion": jpm["tokens_completion"],
+        "invoice_toon_one_shot": tm["one_shot_ok"], "invoice_toon_final": tm["final_ok"],
+        "invoice_toon_attempts": tm["attempts_used"],
+        "invoice_toon_tokens_prompt": tm["tokens_prompt"], "invoice_toon_tokens_completion": tm["tokens_completion"],
+    }
+
+# =========================================
+# Summary helpers
+# =========================================
+def summarize_formats(results: Dict[str, Any]) -> Dict[str, Any]:
+    cases = ["users", "order", "company", "invoice"]
+    summary = {}
+    for fmt in ["json", "json_plain", "toon"]:
+        one_shot_hits = sum(1 for case in cases if results.get(f"{case}_{fmt}_one_shot"))
+        final_hits    = sum(1 for case in cases if results.get(f"{case}_{fmt}_final"))
+        n = len(cases)
+        prompt_tokens = sum(results.get(f"{case}_{fmt}_tokens_prompt", 0) for case in cases)
+        comp_tokens   = sum(results.get(f"{case}_{fmt}_tokens_completion", 0) for case in cases)
+        summary[f"{fmt}_one_shot_accuracy"] = one_shot_hits / n if n else 0.0
+        summary[f"{fmt}_final_accuracy"]    = final_hits / n if n else 0.0
+        summary[f"{fmt}_prompt_tokens"]     = prompt_tokens
+        summary[f"{fmt}_completion_tokens"] = comp_tokens
+        summary[f"{fmt}_total_tokens"]      = prompt_tokens + comp_tokens
+    summary["overall_prompt_tokens"]     = summary["json_prompt_tokens"] + summary["json_plain_prompt_tokens"] + summary["toon_prompt_tokens"]
+    summary["overall_completion_tokens"] = summary["json_completion_tokens"] + summary["json_plain_completion_tokens"] + summary["toon_completion_tokens"]
+    summary["overall_total_tokens"]      = summary["json_total_tokens"] + summary["json_plain_total_tokens"] + summary["toon_total_tokens"]
+    return summary
+
+def flatten_for_csv(model: str, run_idx: int, results: Dict[str, Any]) -> Dict[str, Any]:
+    row = {"model": model, "run": run_idx}
+    for case in ["users", "order", "company", "invoice"]:
+        for fmt in ["json", "json_plain", "toon"]:
+            row[f"{case}_{fmt}_one_shot"] = results.get(f"{case}_{fmt}_one_shot", False)
+            row[f"{case}_{fmt}_final"]    = results.get(f"{case}_{fmt}_final", False)
+            row[f"{case}_{fmt}_attempts"] = results.get(f"{case}_{fmt}_attempts", 0)
+            row[f"{case}_{fmt}_prompt_tokens"] = results.get(f"{case}_{fmt}_tokens_prompt", 0)
+            row[f"{case}_{fmt}_completion_tokens"] = results.get(f"{case}_{fmt}_tokens_completion", 0)
+    summary = summarize_formats(results)
+    row.update({
+        "json_one_shot_accuracy": summary["json_one_shot_accuracy"],
+        "json_final_accuracy":    summary["json_final_accuracy"],
+        "json_prompt_tokens":     summary["json_prompt_tokens"],
+        "json_completion_tokens": summary["json_completion_tokens"],
+        "json_total_tokens":      summary["json_total_tokens"],
+        "json_plain_one_shot_accuracy": summary["json_plain_one_shot_accuracy"],
+        "json_plain_final_accuracy":    summary["json_plain_final_accuracy"],
+        "json_plain_prompt_tokens":     summary["json_plain_prompt_tokens"],
+        "json_plain_completion_tokens": summary["json_plain_completion_tokens"],
+        "json_plain_total_tokens":      summary["json_plain_total_tokens"],
+        "toon_one_shot_accuracy": summary["toon_one_shot_accuracy"],
+        "toon_final_accuracy":    summary["toon_final_accuracy"],
+        "toon_prompt_tokens":     summary["toon_prompt_tokens"],
+        "toon_completion_tokens": summary["toon_completion_tokens"],
+        "toon_total_tokens":      summary["toon_total_tokens"],
+        "overall_prompt_tokens":  summary["overall_prompt_tokens"],
+        "overall_completion_tokens": summary["overall_completion_tokens"],
+        "overall_total_tokens":   summary["overall_total_tokens"],
+    })
+    return row
+
+# =========================================
+# Main (iterate models × runs, write CSV)
+# =========================================
+if __name__ == "__main__":
+    header_fields = ["model", "run"]
+    for case in ["users", "order", "company", "invoice"]:
+        for fmt in ["json", "json_plain", "toon"]:
+            header_fields += [
+                f"{case}_{fmt}_one_shot",
+                f"{case}_{fmt}_final",
+                f"{case}_{fmt}_attempts",
+                f"{case}_{fmt}_prompt_tokens",
+                f"{case}_{fmt}_completion_tokens",
+            ]
+    header_fields += [
+        "json_one_shot_accuracy","json_final_accuracy",
+        "json_prompt_tokens","json_completion_tokens","json_total_tokens",
+        "json_plain_one_shot_accuracy","json_plain_final_accuracy",
+        "json_plain_prompt_tokens","json_plain_completion_tokens","json_plain_total_tokens",
+        "toon_one_shot_accuracy","toon_final_accuracy",
+        "toon_prompt_tokens","toon_completion_tokens","toon_total_tokens",
+        "overall_prompt_tokens","overall_completion_tokens","overall_total_tokens",
+    ]
+
+    write_header = not CSV_PATH.exists()
+    with CSV_PATH.open("a", newline="", encoding="utf-8") as f:
+        writer = csv.DictWriter(f, fieldnames=header_fields)
+        if write_header:
+            writer.writeheader()
+
+        for model in MODELS:
+            print(f"Processing {model}...")
+            for run_idx in range(1, RUNS_PER_MODEL + 1):
+                print(f"Run {run_idx}...")
+                results: Dict[str, Any] = {}
+                results.update(run_case_users(model))
+                print("Users done")
+                results.update(run_case_order(model))
+                print("Order done")
+                results.update(run_case_company(model))
+                print("Company done")
+                results.update(run_case_invoice(model))
+                print("Invoice done")
+                row = flatten_for_csv(model, run_idx, results)
+                writer.writerow(row)
+
+    print(f"Wrote per-run stats to {CSV_PATH.resolve()}")
\ No newline at end of file
diff --git a/benchmarks/generation/generate.py b/benchmarks/generation/generate.py
new file mode 100644
index 00000000..8646c552
--- /dev/null
+++ b/benchmarks/generation/generate.py
@@ -0,0 +1,168 @@
+# generate.py
+from typing import List, Literal, Optional
+import json
+import subprocess
+from pathlib import Path
+from pydantic import BaseModel, ConfigDict, Field
+
+# ---------- Pydantic models (simple) ----------
+class UserRow(BaseModel):
+    model_config = ConfigDict(extra='forbid', strict=True)
+    id: int
+    name: str = Field(min_length=1)
+    role: Literal['admin', 'staff', 'guest']
+
+class Customer(BaseModel):
+    model_config = ConfigDict(extra='forbid', strict=True)
+    id: int
+    name: str
+
+class OrderItem(BaseModel):
+    model_config = ConfigDict(extra='forbid', strict=True)
+    sku: str
+    qty: int
+    price: float
+
+class Order(BaseModel):
+    model_config = ConfigDict(extra='forbid', strict=True)
+    id: int
+    customer: Customer
+    items: List[OrderItem]
+
+
+# ---------- Pydantic models (more complex #1: company) ----------
+class Employee(BaseModel):
+    model_config = ConfigDict(extra='forbid', strict=True)
+    id: int
+    name: str
+    title: Literal['engineer', 'manager', 'analyst']
+
+class Department(BaseModel):
+    model_config = ConfigDict(extra='forbid', strict=True)
+    code: str
+    name: str
+    employees: List[Employee]
+
+class Company(BaseModel):
+    model_config = ConfigDict(extra='forbid', strict=True)
+    id: int
+    name: str
+    departments: List[Department]
+
+
+# ---------- Pydantic models (more complex #2: invoice) ----------
+class InvoiceLine(BaseModel):
+    model_config = ConfigDict(extra='forbid', strict=True)
+    sku: str
+    qty: int
+    unit_price: float
+    line_total: float  # keep explicit to avoid computed logic here
+
+class Totals(BaseModel):
+    model_config = ConfigDict(extra='forbid', strict=True)
+    subtotal: float
+    tax: float
+    grand_total: float
+
+class Invoice(BaseModel):
+    model_config = ConfigDict(extra='forbid', strict=True)
+    number: str
+    currency: Literal['USD', 'EUR', 'SAR']
+    customer: Customer
+    items: List[InvoiceLine]
+    totals: Totals
+    notes: Optional[str] = None
+
+
+# ---------- Create gold Python objects ----------
+# 1) Tabular users
+users = [
+    UserRow(id=1, name="Alice", role="admin"),
+    UserRow(id=2, name="Bob",   role="staff"),
+    UserRow(id=3, name="Eve",   role="guest"),
+]
+users_gold = {"users": [u.model_dump() for u in users]}
+
+# 2) Nested order
+order_gold = Order(
+    id=101,
+    customer=Customer(id=9, name="Ada"),
+    items=[
+        OrderItem(sku="A1", qty=2, price=9.99),
+        OrderItem(sku="B2", qty=1, price=14.50),
+    ],
+).model_dump()
+
+# 3) More complex: company with nested tabular arrays
+company_gold = Company(
+    id=1,
+    name="Acme",
+    departments=[
+        Department(
+            code="ENG",
+            name="Engineering",
+            employees=[
+                Employee(id=1, name="Alice", title="engineer"),
+                Employee(id=2, name="Bob",   title="manager"),
+            ],
+        ),
+        Department(
+            code="OPS",
+            name="Operations",
+            employees=[
+                Employee(id=3, name="Eve", title="analyst"),
+            ],
+        ),
+    ],
+).model_dump()
+
+# 4) More complex: invoice with nested objects + tabular line items
+invoice_gold = Invoice(
+    number="INV-2025-001",
+    currency="USD",
+    customer=Customer(id=9, name="Ada"),
+    items=[
+        InvoiceLine(sku="A1", qty=2, unit_price=9.99, line_total=19.98),
+        InvoiceLine(sku="B2", qty=1, unit_price=14.50, line_total=14.50),
+    ],
+    totals=Totals(subtotal=34.48, tax=6.90, grand_total=41.38),
+    notes="Thank you for your business.",
+).model_dump()
+
+
+# ---------- Write gold JSON to disk ----------
+outdir = Path("gold")
+outdir.mkdir(exist_ok=True)
+
+def write_json(path: Path, obj) -> None:
+    path.write_text(json.dumps(obj, ensure_ascii=False, separators=(",", ":")), encoding="utf-8")
+
+users_json_path   = outdir / "users.gold.json"
+order_json_path   = outdir / "order.gold.json"
+company_json_path = outdir / "company.gold.json"
+invoice_json_path = outdir / "invoice.gold.json"
+
+write_json(users_json_path, users_gold)
+write_json(order_json_path, order_gold)
+write_json(company_json_path, company_gold)
+write_json(invoice_json_path, invoice_gold)
+
+
+# ---------- Use TOON CLI via npx to encode JSON -> TOON ----------
+def encode_to_toon(json_path: Path, toon_path: Path) -> None:
+    subprocess.run(
+        ["npx", "@toon-format/cli", str(json_path), "-o", str(toon_path)],
+        check=True,
+    )
+
+encode_to_toon(users_json_path,   outdir / "users.gold.toon")
+encode_to_toon(order_json_path,   outdir / "order.gold.toon")
+encode_to_toon(company_json_path, outdir / "company.gold.toon")
+encode_to_toon(invoice_json_path, outdir / "invoice.gold.toon")
+
+print("Wrote:")
+for p in [users_json_path, outdir / "users.gold.toon",
+          order_json_path, outdir / "order.gold.toon",
+          company_json_path, outdir / "company.gold.toon",
+          invoice_json_path, outdir / "invoice.gold.toon"]:
+    print(f"  {p}")
diff --git a/benchmarks/generation/gold/company.gold.json b/benchmarks/generation/gold/company.gold.json
new file mode 100644
index 00000000..77238a84
--- /dev/null
+++ b/benchmarks/generation/gold/company.gold.json
@@ -0,0 +1 @@
+{"id":1,"name":"Acme","departments":[{"code":"ENG","name":"Engineering","employees":[{"id":1,"name":"Alice","title":"engineer"},{"id":2,"name":"Bob","title":"manager"}]},{"code":"OPS","name":"Operations","employees":[{"id":3,"name":"Eve","title":"analyst"}]}]}
\ No newline at end of file
diff --git a/benchmarks/generation/gold/company.gold.toon b/benchmarks/generation/gold/company.gold.toon
new file mode 100644
index 00000000..43b015bb
--- /dev/null
+++ b/benchmarks/generation/gold/company.gold.toon
@@ -0,0 +1,12 @@
+id: 1
+name: Acme
+departments[2]:
+  - code: ENG
+    name: Engineering
+    employees[2]{id,name,title}:
+      1,Alice,engineer
+      2,Bob,manager
+  - code: OPS
+    name: Operations
+    employees[1]{id,name,title}:
+      3,Eve,analyst
\ No newline at end of file
diff --git a/benchmarks/generation/gold/invoice.gold.json b/benchmarks/generation/gold/invoice.gold.json
new file mode 100644
index 00000000..6b3ce1d0
--- /dev/null
+++ b/benchmarks/generation/gold/invoice.gold.json
@@ -0,0 +1 @@
+{"number":"INV-2025-001","currency":"USD","customer":{"id":9,"name":"Ada"},"items":[{"sku":"A1","qty":2,"unit_price":9.99,"line_total":19.98},{"sku":"B2","qty":1,"unit_price":14.5,"line_total":14.5}],"totals":{"subtotal":34.48,"tax":6.9,"grand_total":41.38},"notes":"Thank you for your business."}
\ No newline at end of file
diff --git a/benchmarks/generation/gold/invoice.gold.toon b/benchmarks/generation/gold/invoice.gold.toon
new file mode 100644
index 00000000..d1e93c91
--- /dev/null
+++ b/benchmarks/generation/gold/invoice.gold.toon
@@ -0,0 +1,13 @@
+number: INV-2025-001
+currency: USD
+customer:
+  id: 9
+  name: Ada
+items[2]{sku,qty,unit_price,line_total}:
+  A1,2,9.99,19.98
+  B2,1,14.5,14.5
+totals:
+  subtotal: 34.48
+  tax: 6.9
+  grand_total: 41.38
+notes: Thank you for your business.
\ No newline at end of file
diff --git a/benchmarks/generation/gold/order.gold.json b/benchmarks/generation/gold/order.gold.json
new file mode 100644
index 00000000..c3c2a604
--- /dev/null
+++ b/benchmarks/generation/gold/order.gold.json
@@ -0,0 +1 @@
+{"id":101,"customer":{"id":9,"name":"Ada"},"items":[{"sku":"A1","qty":2,"price":9.99},{"sku":"B2","qty":1,"price":14.5}]}
\ No newline at end of file
diff --git a/benchmarks/generation/gold/order.gold.toon b/benchmarks/generation/gold/order.gold.toon
new file mode 100644
index 00000000..ebc448fc
--- /dev/null
+++ b/benchmarks/generation/gold/order.gold.toon
@@ -0,0 +1,7 @@
+id: 101
+customer:
+  id: 9
+  name: Ada
+items[2]{sku,qty,price}:
+  A1,2,9.99
+  B2,1,14.5
\ No newline at end of file
diff --git a/benchmarks/generation/gold/users.gold.json b/benchmarks/generation/gold/users.gold.json
new file mode 100644
index 00000000..ac535be2
--- /dev/null
+++ b/benchmarks/generation/gold/users.gold.json
@@ -0,0 +1 @@
+{"users":[{"id":1,"name":"Alice","role":"admin"},{"id":2,"name":"Bob","role":"staff"},{"id":3,"name":"Eve","role":"guest"}]}
\ No newline at end of file
diff --git a/benchmarks/generation/gold/users.gold.toon b/benchmarks/generation/gold/users.gold.toon
new file mode 100644
index 00000000..f19feaad
--- /dev/null
+++ b/benchmarks/generation/gold/users.gold.toon
@@ -0,0 +1,4 @@
+users[3]{id,name,role}:
+  1,Alice,admin
+  2,Bob,staff
+  3,Eve,guest
\ No newline at end of file
diff --git a/benchmarks/generation/requirements.txt b/benchmarks/generation/requirements.txt
new file mode 100644
index 00000000..4505df10
--- /dev/null
+++ b/benchmarks/generation/requirements.txt
@@ -0,0 +1,3 @@
+pydantic>=2.5.0
+openai>=1.0.0
+pandas>=2.0.0