diff --git a/benchmarks/generation/README.md b/benchmarks/generation/README.md new file mode 100644 index 00000000..f691a437 --- /dev/null +++ b/benchmarks/generation/README.md @@ -0,0 +1,180 @@ +--- +## Token-Oriented Object Notation vs JSON: a benchmark of plain and constrained decoding generation" + +[Token-Oriented Object Notation](https://github.com/toon-format) is a compact, human-readable encoding of the JSON data model that minimizes tokens and makes structure easy for models to follow. It's intended for LLM input as a drop-in, lossless representation of your existing JSON. + +While TOON is primarily designed for input, its token efficiency makes it a candidate for LLM output in specific high-volume scenarios. This benchmark compares three generation strategies across 21 models. + +### Benchmark design + +**Gold standard:** Created from Pydantic models and serialized to `*.gold.json` (canonical JSON) and `*.gold.toon` (via `@toon-format/cli`). + +**Test cases:** +1. **users**: Simple tabular structure. +2. **order**: Nested structure with array. +3. **company**: Department and employee hierarchy (deep nesting). +4. **invoice**: Items and totals. + +**Test tracks:** +* **JSON track (J):** Plain JSON generation with Pydantic validation. +* **JSON-SO track (JSO):** Structured output (`response_format="json_object"`) with constrained decoding. The inference engine compiles constraints (schema/grammar) into a state machine (e.g., xgrammar) to mask illegal tokens during generation, enforcing valid syntax. +* **TOON track (T):** TOON output followed by CLI decoding. Prompts used **universal examples** (not custom-tailored to the specific schema) to ensure a fair comparison with JSON. + +**Sampling & evaluation:** +* **Parameters:** Temperature 0 for deterministic output. +* **Runs:** 10 iterations per test case per model (21 models via [Nebius API](https://tokenfactory.nebius.com/)). +* **Process:** + 1. Model generates output (J, JSO, or T). + 2. (TOON only) CLI decodes to JSON. CLI errors trigger a **repair cycle**. + 3. Validation via Pydantic & Data canonicalization. + 4. Comparison with Gold Standard. + 5. **Repair cycle:** If validation/comparison fails, the previous output and error text are inserted into the prompt (up to 3 attempts). + +### Key findings + +* **Aligned data ("sweet spot"):** TOON excels in tabular and uniform nested structures (e.g., invoices, orders), achieving **90.5%** accuracy in 1-shot tests while offering significant token savings. +* **Prompt tax:** Unlike JSON, which is native to model training, TOON requires instructional prompting. For short outputs, this overhead reduces efficiency; for larger outputs (batches/logs), the syntax savings amortize the cost. +* **Structured output trade-off:** Constrained decoding (JSO) acts as a safety net for smaller models (preventing syntax errors) but was found to degrade reasoning/accuracy in some larger models ("structured output paradox"). + +### Results by data topology + +Performance varies significantly based on how well the data aligns with TOON's design (e.g., uniform arrays vs. deep recursive nesting). + +| Case | J (1-S) | J (Fin) | J (Tok) | JSO (1-S) | JSO (Fin) | JSO (Tok) | T (1-S) | T (Fin) | T (Tok) | +| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | +| **users** | 94.8% | 94.8% | 1078 | 92.9% | **100%** | 556 | **90.5%** | 90.5% | 840 | +| **order** | 81.9% | 81.9% | 1746 | 78.6% | 83.3% | 1255 | 74.3% | 78.6% | 1585 | +| **company** | 18.6% | 43.8% | 3575 | **21.9%** | **48.1%** | 2592 | 0.0% | 48.6% | 2567 | +| **invoice** | 90.0% | 90.0% | 1723 | 87.6% | **95.2%** | 1349 | 0.0% | 52.4% | 3626 | + +### Full results by model + +The following table compares **1-shot accuracy (1-S)**, **final accuracy (Fin)** after repair loops, and the total **token budget (Tok)** required for successful generation. + +| Model | J (1-S) | J (Fin) | J (Tok) | JSO (1-S) | JSO (Fin) | JSO (Tok) | T (1-S) | T (Fin) | T (Tok) | +| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | +| **NousResearch/Hermes-4-405B** | 92.5% | 92.5% | 3252 | 35.0% | **100%** | 4759 | 50.0% | 60.0% | 4671 | +| **NousResearch/Hermes-4-70B** | 75.0% | 75.0% | 4414 | 37.5% | 75.0% | 5594 | 50.0% | 50.0% | 4738 | +| **PrimeIntellect/INTELLECT-3** | 72.5% | 75.0% | 10682 | 72.5% | 77.5% | 10103 | 40.0% | 65.0% | 13315 | +| **Qwen/Qwen2.5-Coder-7B-fast** | 0.0% | 0.0% | 37705 | 75.0% | 75.0% | 4440 | 27.5% | 27.5% | 32715 | +| **Qwen/Qwen3-235B-A22B-Inst** | **100%** | **100%** | 2772 | **100%** | **100%** | 2772 | 50.0% | **100%** | 4715 | +| **Qwen/Qwen3-235B-A22B-Thk** | 82.5% | 82.5% | 11425 | 87.5% | 97.5% | 7899 | 50.0% | 97.5% | 17457 | +| **Qwen/Qwen3-30B-A3B-Inst** | 75.0% | 75.0% | 4436 | 75.0% | 75.0% | 4436 | 50.0% | 70.0% | 5505 | +| **Qwen/Qwen3-32B** | 75.0% | 77.5% | 10196 | 75.0% | 75.0% | 4120 | 47.5% | 80.0% | 9101 | +| **Qwen/Qwen3-Coder-30B-A3B** | 75.0% | 75.0% | 4206 | 75.0% | 75.0% | 4206 | 50.0% | **100%** | 4719 | +| **Qwen/Qwen3-Coder-480B** | 75.0% | 75.0% | 4462 | 75.0% | 75.0% | 4447 | 50.0% | 75.0% | 4515 | +| **deepseek-ai/DeepSeek-R1** | 55.0% | 70.0% | 13811 | 65.0% | 80.0% | 4149 | 25.0% | 50.0% | 19047 | +| **deepseek-ai/DeepSeek-V3-fast** | 75.0% | **100%** | 3600 | 75.0% | **100%** | 3584 | 25.0% | 80.0% | 4734 | +| **google/gemma-2-2b-it** | 75.0% | **100%** | 4721 | 77.5% | **100\%** | 4566 | 0.0% | 0.0% | 5955 | +| **google/gemma-2-9b-it-fast** | 75.0% | 75.0% | 6086 | 75.0% | 75.0% | 6056 | 50.0% | 75.0% | 5419 | +| **meta-llama/Llama-3.3-70B** | 75.0% | 75.0% | 4551 | 75.0% | 75.0% | 4447 | 50.0% | 50.0% | 5148 | +| **meta-llama/Llama-3.1-8B** | 72.5% | 72.5% | 7235 | 75.0% | 75.0% | 6941 | 22.5\% | 25.0% | 4915 | +| **moonshotai/Kimi-K2-Instruct** | 50.0% | 75.0% | 4284 | 50.0% | 75.0% | 4283 | 50.0\% | **100\%** | 3937 | +| **nvidia/Llama-3_1-Nemotron** | 75.0% | 75.0% | 4426 | 50.0% | 50.0% | 5714 | 50.0% | 82.5% | 4368 | +| **openai/gpt-oss-120b** | **97.5%** | **100%** | 3685 | **100%** | **100%** | 3545 | 50.0% | 87.5% | 8223 | +| **openai/gpt-oss-20b** | 50.0% | 72.5% | 14943 | 50.0% | 67.5% | 15601 | 50.0% | 90.0% | 9678 | +| **zai-org/GLM-4.5** | 75.0% | 87.5% | 9677 | 75.0\% | 92.5\% | 9135 | 27.5\% | 52.5\% | 8110 | + +### Observations + +**1. The "Structured Output Paradox"** +Constrained decoding is not always superior. For `Hermes-4-405B`, applying constraints dropped 1-shot accuracy from **92.5%** (Plain JSON) to **35.0%** (Structured Output). This suggests that for some high-reasoning models, forcing specific grammar paths can actively interfere with the model's logic capabilities. + +**2. Guardrails for smaller models** +Conversely, for smaller models like `Qwen/Qwen2.5-Coder-7B-fast`, structured output is essential. It raised performance from a catastrophic **0%** (Plain JSON) to a viable **75%**. + +**3. TOON repair potential** +While TOON often has lower initial 1-shot accuracy due to the novelty of the format, several models (`Qwen/Qwen3-Coder-30B`, `Kimi-K2-Instruct`, `Qwen/Qwen3-235B`) achieved **100% final accuracy** after repair loops. This indicates that while the format may be unfamiliar initially, the error messages provided by the TOON CLI are highly effective for self-correction. + +**4. Token efficiency scaling** +In cases like `Qwen3-235B-A22B-Inst`, TOON consumed significantly more tokens (~4700) than JSON (~2700). This confirms the "prompt tax" hypothesis: for short tasks, the instructional overhead outweighs the syntax savings. TOON becomes efficient primarily in high-volume generation where the output length justifies the system prompt. + +### Analysis & recommendations + +1. **Aligned data streams:** Use TOON generation for **SQL dumps, logs, and transactional documents**. The token savings on high-volume, uniform data outweigh the prompt overhead. +2. **Avoid deep nesting:** For deeply nested or recursive state trees (like DOMs), stick to **JSON** or **JSO**. TOON's indentation tracking is less robust for these structures in one-shot generation. +3. **Repair loops:** TOON generation benefits disproportionately from repair loops (feeding errors back to context), often correcting format issues that initial constrained decoding cannot fix. + +
+Installation & Usage (click to expand) + +
+ +This repository contains two main scripts: + +- **`generate.py`** – builds the gold-standard reference outputs used for evaluation. +- **`eval.py`** – runs the full benchmark across all models and decoding strategies. + +Before running the benchmark, install dependencies and create the gold files. + +### **1. Install Python dependencies** + +```bash +pip install -r requirements.txt +``` + +### **2. Install the TOON CLI (required for encoding/decoding)** + +```bash +npm install -g @toon-format/cli +``` + +Alternatively, you can rely on `npx` without a global install. + +### **3. Generate the gold reference outputs** + +This step must be run **once**, or whenever you modify the schemas in `generate.py`: + +```bash +python generate.py +``` + +This will create: + +``` +gold/users.gold.json gold/users.gold.toon +gold/order.gold.json gold/order.gold.toon +gold/company.gold.json gold/company.gold.toon +gold/invoice.gold.json gold/invoice.gold.toon +``` + +### **4. Set your model API key** + +The benchmark uses the Nebius Token Factory API. Set: + +```bash +export LLM_API_KEY="your_nebius_api_key" +``` + +### **5. Run the full benchmark** + +```bash +python eval.py +``` + +This will: + +- Run all test cases for 21 models +- Perform JSON, JSON-SO, and TOON generation +- Decode TOON outputs via CLI +- Validate against the gold standard +- Apply repair loops +- Write per-run statistics to: + +``` +eval_runs.csv +``` + +### **Repository structure** + +``` +├── generate.py # Defines schemas, builds gold objects, writes gold/*.json + *.toon +├── eval.py # Full benchmark runner +├── gold/ # Auto-generated canonical reference data +│ ├── *.gold.json +│ ├── *.gold.toon +├── requirements.txt +└── README.md +``` + +
diff --git a/benchmarks/generation/eval.py b/benchmarks/generation/eval.py new file mode 100644 index 00000000..53d65a24 --- /dev/null +++ b/benchmarks/generation/eval.py @@ -0,0 +1,876 @@ +# eval_simple.py +import json +import os +import re +import csv +import subprocess +import time +from pathlib import Path +from typing import Any, Dict, List, Tuple, Type + +from pydantic import BaseModel, TypeAdapter +from openai import OpenAI +from openai import APIError, InternalServerError, RateLimitError + +# --- Import Pydantic models from your generate.py --- +from generate import ( + UserRow, Order, + Company, Invoice, +) + +# ========================================= +# Config: models + runs + output CSV +# ========================================= +MODELS = [ +'deepseek-ai/DeepSeek-V3-0324-fast', +'openai/gpt-oss-120b', +'moonshotai/Kimi-K2-Instruct', +'Qwen/Qwen3-Coder-480B-A35B-Instruct', +'NousResearch/Hermes-4-405B', +'NousResearch/Hermes-4-70B', +'openai/gpt-oss-20b', +'zai-org/GLM-4.5', +'deepseek-ai/DeepSeek-R1-0528', +'PrimeIntellect/INTELLECT-3', +'Qwen/Qwen3-235B-A22B-Thinking-2507', +'Qwen/Qwen3-235B-A22B-Instruct-2507', +'Qwen/Qwen3-30B-A3B-Instruct-2507', +'Qwen/Qwen3-Coder-30B-A3B-Instruct', +'Qwen/Qwen3-32B', +'nvidia/Llama-3_1-Nemotron-Ultra-253B-v1', +'meta-llama/Llama-3.3-70B-Instruct', +'meta-llama/Meta-Llama-3.1-8B-Instruct', +'Qwen/Qwen2.5-Coder-7B-fast', +'google/gemma-2-2b-it', +'google/gemma-2-9b-it-fast' +] +RUNS_PER_MODEL = 10 +CSV_PATH = Path("eval_runs.csv") + +# ========================================= +# LLM client +# ========================================= +LLM_API_KEY = os.environ.get("LLM_API_KEY") +if not LLM_API_KEY: + raise RuntimeError("Missing LLM_API_KEY environment variable") + +client = OpenAI( + base_url="https://api.studio.nebius.com/v1/", + api_key=LLM_API_KEY, +) + +SYSTEM_PROMPT = ( + "You are a data-formatting model. " + "Follow instructions exactly. When asked for JSON, you must return JSON that conforms to the provided JSON Schema. " + "No extra text. When asked for TOON, return only a ```toon fenced block." +) + +# ========================================= +# Retry wrapper for API calls +# ========================================= +def retry_on_error(func, max_retries=5, initial_delay=2.0): + """Retry a function with exponential backoff on API errors.""" + for attempt in range(max_retries): + try: + return func() + except (InternalServerError, APIError, RateLimitError) as e: + if attempt == max_retries - 1: + print(f"Failed after {max_retries} attempts: {e}") + raise + + delay = initial_delay * (2 ** attempt) + print(f"API error (attempt {attempt + 1}/{max_retries}): {e}") + print(f"Retrying in {delay:.1f} seconds...") + time.sleep(delay) + except Exception as e: + # Don't retry on other exceptions (validation errors, etc.) + raise + +# ========================================= +# Structured JSON call (json_schema) +# ========================================= +def llm_call_json_structured(model: str, prompt: str, schema_model: Type[BaseModel]) -> Tuple[str, int, int]: + """Return (json_text, prompt_tokens, completion_tokens) with JSON object output.""" + print(f"Calling {model} json_structured") + # Add schema to prompt for guidance + schema_prompt = f"{prompt}\n\nReturn valid JSON matching this schema:\n{json.dumps(schema_model.model_json_schema(), indent=2)}" + + def _call(): + resp = client.chat.completions.create( + model=model, + max_tokens=10000, + temperature=0.0, + top_p=1.0, + extra_body={"top_k": 50}, + response_format={ + "type": "json_object", + }, + messages=[ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": schema_prompt}, + ], + ) + + msg = resp.choices[0].message + + # Handle refusal + if msg.refusal: + raise ValueError(f"Model refused: {msg.refusal}") + + text = (msg.content or "").strip() + usage = getattr(resp, "usage", None) + p = getattr(usage, "prompt_tokens", 0) if usage else 0 + c = getattr(usage, "completion_tokens", 0) if usage else 0 + + return text, p, c + + return retry_on_error(_call) + +# ========================================= +# Plain JSON call (no response_format) +# ========================================= +def llm_call_json_plain(model: str, prompt: str, schema_model: Type[BaseModel]) -> Tuple[str, int, int]: + """Return (json_text, prompt_tokens, completion_tokens) with plain text completion.""" + print(f"Calling {model} json_plain") + # Add schema to prompt for guidance + schema_prompt = f"{prompt}\n\nReturn valid JSON matching this schema:\n{json.dumps(schema_model.model_json_schema(), indent=2)}" + + def _call(): + resp = client.chat.completions.create( + model=model, + max_tokens=10000, + temperature=0.0, + top_p=1.0, + extra_body={"top_k": 50}, + messages=[ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": schema_prompt}, + ], + ) + + text = resp.choices[0].message.content or "" + # Remove think tags + text = re.sub(r".*?", "", text, flags=re.DOTALL).strip() + # Remove markdown code fences if present + text = re.sub(r"```(?:json)?\s*(.*?)```", r"\1", text, flags=re.DOTALL).strip() + + usage = getattr(resp, "usage", None) + p = getattr(usage, "prompt_tokens", 0) if usage else 0 + c = getattr(usage, "completion_tokens", 0) if usage else 0 + + return text, p, c + + return retry_on_error(_call) + +# ========================================= +# Plain call (for TOON generation) +# ========================================= +def llm_call_plain(model: str, prompt: str) -> Tuple[str, int, int]: + print(f"Calling {model} plain") + + def _call(): + resp = client.chat.completions.create( + model=model, + max_tokens=10000, + temperature=0.0, + top_p=1.0, + extra_body={"top_k": 50}, + messages=[ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": prompt}, + ], + ) + text = resp.choices[0].message.content or "" + text = re.sub(r".*?", "", text, flags=re.DOTALL).strip() + usage = getattr(resp, "usage", None) + p = getattr(usage, "prompt_tokens", 0) if usage else 0 + c = getattr(usage, "completion_tokens", 0) if usage else 0 + return text, p, c + + return retry_on_error(_call) + +# ========================================= +# Paths +# ========================================= +GOLD = Path("gold") +USERS_JSON = GOLD / "users.gold.json" +ORDER_JSON = GOLD / "order.gold.json" +COMPANY_JSON = GOLD / "company.gold.json" +INVOICE_JSON = GOLD / "invoice.gold.json" + +# ========================================= +# Canonicalization (stable compare) +# ========================================= +def sort_users_by_id(obj: Dict[str, Any]) -> Dict[str, Any]: + if "users" in obj and isinstance(obj["users"], list): + obj["users"] = sorted(obj["users"], key=lambda r: r.get("id")) + return obj + +def sort_order_items(obj: Dict[str, Any]) -> Dict[str, Any]: + if "items" in obj and isinstance(obj["items"], list): + obj["items"] = sorted(obj["items"], key=lambda r: r.get("sku")) + return obj + +def sort_company(obj: Dict[str, Any]) -> Dict[str, Any]: + if "departments" in obj and isinstance(obj["departments"], list): + obj["departments"] = sorted(obj["departments"], key=lambda d: d.get("code")) + for d in obj["departments"]: + if isinstance(d, dict) and "employees" in d and isinstance(d["employees"], list): + d["employees"] = sorted(d["employees"], key=lambda e: e.get("id")) + return obj + +def sort_invoice(obj: Dict[str, Any]) -> Dict[str, Any]: + if "items" in obj and isinstance(obj["items"], list): + obj["items"] = sorted(obj["items"], key=lambda r: r.get("sku")) + return obj + +def canonical_json(obj: Any, case: str) -> Any: + if case == "users": return sort_users_by_id(obj) + if case == "order": return sort_order_items(obj) + if case == "company": return sort_company(obj) + if case == "invoice": return sort_invoice(obj) + return obj + +# ========================================= +# Pydantic validation (+ shape normalization) +# ========================================= +class UsersPayload(BaseModel): + users: List[UserRow] + +def validate_users_json(data: Any) -> List[UserRow]: + if not isinstance(data, dict) or "users" not in data: + raise ValueError("Expected object with key 'users'") + adapter = TypeAdapter(List[UserRow]) + return adapter.validate_python(data["users"]) + +def normalize_by_key(data: Any, key: str) -> Any: + if isinstance(data, dict) and key in data and isinstance(data[key], dict): + return data[key] + return data + +def validate_order_json(data: Any) -> Order: + data = normalize_by_key(data, "order") # TOON may wrap + adapter = TypeAdapter(Order) + return adapter.validate_python(data) + +def validate_company_json(data: Any) -> Company: + data = normalize_by_key(data, "company") + adapter = TypeAdapter(Company) + return adapter.validate_python(data) + +def validate_invoice_json(data: Any) -> Invoice: + data = normalize_by_key(data, "invoice") + adapter = TypeAdapter(Invoice) + return adapter.validate_python(data) + +# ========================================= +# TOON decode via official CLI +# ========================================= +def extract_toon_payload(toon_text: str) -> str: + m = re.search(r"```toon\s*(.*?)```", toon_text, flags=re.DOTALL | re.IGNORECASE) + return m.group(1).strip() if m else toon_text.strip() + +def decode_toon_to_json(toon_text: str) -> Any: + payload = extract_toon_payload(toon_text) + proc = subprocess.run( + ["npx", "@toon-format/cli", "--decode"], + input=payload.encode("utf-8"), + capture_output=True, + check=True, + ) + return json.loads(proc.stdout.decode("utf-8")) + +# ========================================= +# Prompts — JSON (structured) / TOON +# ========================================= +def make_json_prompt_users() -> str: + return ( + "Create a user directory with three users:\n" + "- User 1: Alice, who is an admin\n" + "- User 2: Bob, who is a staff member\n" + "- User 3: Eve, who is a guest\n\n" + "Return the data as JSON with a 'users' array containing objects with id, name, and role fields." + ) + +def make_json_prompt_order() -> str: + return ( + "Create an order record:\n" + "- Order ID: 101\n" + "- Customer: Ada (ID: 9)\n" + "- Items:\n" + " * Product A1: quantity 2, price $9.99 each\n" + " * Product B2: quantity 1, price $14.50 each\n\n" + "Return as JSON with fields for id, customer (with id and name), and items array (with sku, qty, price)." + ) + +def make_json_prompt_company() -> str: + return ( + "Create a company organization structure:\n" + "- Company: Acme (ID: 1)\n" + "- Engineering Department (code: ENG):\n" + " * Alice (ID: 1) - engineer\n" + " * Bob (ID: 2) - manager\n" + "- Operations Department (code: OPS):\n" + " * Eve (ID: 3) - analyst\n\n" + "Return as JSON with company info and nested departments array, each containing employees." + ) + +def make_json_prompt_invoice() -> str: + return ( + "Create an invoice:\n" + "- Invoice number: INV-2025-001\n" + "- Currency: USD\n" + "- Customer: Ada (ID: 9)\n" + "- Line items:\n" + " * A1: quantity 2 @ $9.99 each = $19.98\n" + " * B2: quantity 1 @ $14.50 each = $14.50\n" + "- Subtotal: $34.48\n" + "- Tax: $6.90\n" + "- Grand total: $41.38\n" + "- Notes: Thank you for your business.\n\n" + "Return as JSON with all invoice details including items array and totals breakdown." + ) + +# ========================================= +# Improved TOON Prompts (short nested examples + same tasks) +# ========================================= + + + + + + +def make_toon_prompt_users() -> str: + return ( + "You are to produce output STRICTLY in TOON format.\n\n" + "TOON RULES:\n" + "- Use 2-space indentation\n" + "- Scalars: fieldName: value\n" + "- Objects: fieldName: then nested fields indented\n" + "- Arrays of objects:\n" + " arrayName[N]:\n" + " - field1: value1\n" + " field2: value2\n" + "- Tabular arrays (for simple data):\n" + " arrayName[N]{field1,field2}:\n" + " val1,val2\n" + " val3,val4\n" + "- [N] MUST equal actual row/item count\n" + "- Output ONLY a ```toon code block\n\n" + "Reference example:\n" + "```toon\n" + "id: 100\n" + "type: Sample\n" + "metadata:\n" + " version: 1\n" + " author: Alex\n" + "sections[2]:\n" + " - code: A\n" + " title: Introduction\n" + " items[2]{id,value}:\n" + " 1,First\n" + " 2,Second\n" + " - code: B\n" + " title: Details\n" + " items[1]{id,value}:\n" + " 3,Third\n" + "summary:\n" + " total: 3\n" + " status: complete\n" + "```\n\n" + "TASK:\n" + "Create an array named users with fields id, name, and role.\n" + "User data:\n" + "- id=1, name=Alice, role=admin\n" + "- id=2, name=Bob, role=staff\n" + "- id=3, name=Eve, role=guest\n\n" + "Output only the TOON code block.\n" + ) + + + + +def make_toon_prompt_order() -> str: + return ( + "You are to produce output STRICTLY in TOON format.\n\n" + "TOON RULES:\n" + "- Use 2-space indentation\n" + "- Scalars: fieldName: value\n" + "- Objects: fieldName: then nested fields indented\n" + "- Arrays of objects:\n" + " arrayName[N]:\n" + " - field1: value1\n" + " field2: value2\n" + "- Tabular arrays (for simple data):\n" + " arrayName[N]{field1,field2}:\n" + " val1,val2\n" + " val3,val4\n" + "- [N] MUST equal actual row/item count\n" + "- Output ONLY a ```toon code block\n\n" + "Reference example:\n" + "```toon\n" + "id: 100\n" + "type: Sample\n" + "metadata:\n" + " version: 1\n" + " author: Alex\n" + "sections[2]:\n" + " - code: A\n" + " title: Introduction\n" + " items[2]{id,value}:\n" + " 1,First\n" + " 2,Second\n" + " - code: B\n" + " title: Details\n" + " items[1]{id,value}:\n" + " 3,Third\n" + "summary:\n" + " total: 3\n" + " status: complete\n" + "```\n\n" + "TASK:\n" + "Create an order record with fields: id, customer (with id and name), " + "and items array (with sku, qty, price).\n" + "- Order ID: 101\n" + "- Customer: Ada (ID: 9)\n" + "- Items:\n" + " * Product A1: quantity 2, price $9.99 each\n" + " * Product B2: quantity 1, price $14.50 each\n" + ) + + +def make_toon_prompt_company() -> str: + return ( + "You are to produce output STRICTLY in TOON format.\n\n" + "TOON RULES:\n" + "- Use 2-space indentation\n" + "- Scalars: fieldName: value\n" + "- Objects: fieldName: then nested fields indented\n" + "- Arrays of objects:\n" + " arrayName[N]:\n" + " - field1: value1\n" + " field2: value2\n" + "- Tabular arrays (for simple data):\n" + " arrayName[N]{field1,field2}:\n" + " val1,val2\n" + " val3,val4\n" + "- [N] MUST equal actual row/item count\n" + "- Output ONLY a ```toon code block\n\n" + "Reference example:\n" + "```toon\n" + "id: 100\n" + "type: Sample\n" + "metadata:\n" + " version: 1\n" + " author: Alex\n" + "sections[2]:\n" + " - code: A\n" + " title: Introduction\n" + " items[2]{id,value}:\n" + " 1,First\n" + " 2,Second\n" + " - code: B\n" + " title: Details\n" + " items[1]{id,value}:\n" + " 3,Third\n" + "summary:\n" + " total: 3\n" + " status: complete\n" + "```\n\n" + "TASK:\n" + "Create a company organization structure with company info and nested departments array, each containing employees:\n" + "- Company: Acme (ID: 1)\n" + "- Engineering Department (code: ENG):\n" + " * Alice (ID: 1) - engineer\n" + " * Bob (ID: 2) - manager\n" + "- Operations Department (code: OPS):\n" + " * Eve (ID: 3) - analyst\n\n" + ) + + +def make_toon_prompt_invoice() -> str: + return ( + "You are to produce output STRICTLY in TOON format.\n\n" + "TOON RULES:\n" + "- Use 2-space indentation\n" + "- Scalars: fieldName: value\n" + "- Objects: fieldName: then nested fields indented\n" + "- Arrays of objects:\n" + " arrayName[N]:\n" + " - field1: value1\n" + " field2: value2\n" + "- Tabular arrays (for simple data):\n" + " arrayName[N]{field1,field2}:\n" + " val1,val2\n" + " val3,val4\n" + "- [N] MUST equal actual row/item count\n" + "- Output ONLY a ```toon code block\n\n" + "Reference example:\n" + "```toon\n" + "id: 100\n" + "type: Sample\n" + "metadata:\n" + " version: 1\n" + " author: Alex\n" + "sections[2]:\n" + " - code: A\n" + " title: Introduction\n" + " items[2]{id,value}:\n" + " 1,First\n" + " 2,Second\n" + " - code: B\n" + " title: Details\n" + " items[1]{id,value}:\n" + " 3,Third\n" + "summary:\n" + " total: 3\n" + " status: complete\n" + "```\n\n" + "TASK:\n" + "Create an invoice with all invoice details including items array and totals breakdown:\n" + "- Invoice number: INV-2025-001\n" + "- Currency: USD\n" + "- Customer: Ada (ID: 9)\n" + "- Line items:\n" + " * A1: quantity 2 @ $9.99 each = $19.98\n" + " * B2: quantity 1 @ $14.50 each = $14.50\n" + "- Subtotal: $34.48\n" + "- Tax: $6.90\n" + "- Grand total: $41.38\n" + "- Notes: Thank you for your business.\n" + ) +# ========================================= +# Repair prompts +# ========================================= +def make_json_repair_prompt(prev_output: str, error_msg: str) -> str: + return ( + "Your previous JSON did not validate against the schema. " + "Return ONLY valid JSON (no prose, no fences) that matches the schema and the target values.\n" + f"Validation error:\n{error_msg}\n\n" + "Previous output:\n" + f"{prev_output}\n" + ) + +def make_toon_repair_prompt(prev_output: str, error_msg: str) -> str: + return ( + "Your previous TOON was invalid. Return ONLY a ```toon fenced block.\n" + "- Use 2-space indentation; no trailing spaces.\n" + "- Ensure headers/fieldsets and [N] match row counts.\n" + f"Validation/decoding error:\n{error_msg}\n\n" + "Previous output:\n" + f"{prev_output}\n" + ) + +# ========================================= +# Core evaluation (one-shot + ≤9 repairs) +# ========================================= +MAX_ATTEMPTS = 3 + +def eval_json_track( + model: str, + make_prompt_fn, + schema_model: Type[BaseModel], + validate_fn, + gold_obj, + canon_case: str, +): + tokens_p = tokens_c = 0 + prompt = make_prompt_fn() + out, p, c = llm_call_json_structured(model, prompt, schema_model); tokens_p += p; tokens_c += c + try: + parsed = json.loads(out) + # print(f"JSON SO parsed: {parsed}") + validate_fn(parsed) # Pydantic + parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case) + one_shot_ok = final_ok = (parsed == gold_obj) + if final_ok: + return dict(one_shot_ok=True, final_ok=True, attempts_used=1, + tokens_prompt=tokens_p, tokens_completion=tokens_c) + except Exception as e: + err = str(e); one_shot_ok = False; prev = out + else: + err = "Structure valid but values differ from expected gold."; prev = out + + for i in range(1, MAX_ATTEMPTS): + repair_prompt = make_json_repair_prompt(prev, err) + out, p, c = llm_call_json_structured(model, repair_prompt, schema_model); tokens_p += p; tokens_c += c + try: + parsed = json.loads(out) + validate_fn(parsed) + parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case) + final_ok = (parsed == gold_obj) + if final_ok: + return dict(one_shot_ok=one_shot_ok, final_ok=True, attempts_used=i+1, + tokens_prompt=tokens_p, tokens_completion=tokens_c) + else: + err = "Structure valid but values differ from expected gold." + prev = out + except Exception as e: + err = str(e); prev = out + continue + + return dict(one_shot_ok=one_shot_ok, final_ok=False, attempts_used=MAX_ATTEMPTS, + tokens_prompt=tokens_p, tokens_completion=tokens_c) + +def eval_json_plain_track( + model: str, + make_prompt_fn, + schema_model: Type[BaseModel], + validate_fn, + gold_obj, + canon_case: str, +): + """Evaluate JSON generation without response_format (plain completion).""" + tokens_p = tokens_c = 0 + prompt = make_prompt_fn() + out, p, c = llm_call_json_plain(model, prompt, schema_model); tokens_p += p; tokens_c += c + try: + parsed = json.loads(out) + # print(f"JSON plain parsed: {parsed}") + validate_fn(parsed) # Pydantic + parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case) + one_shot_ok = final_ok = (parsed == gold_obj) + if final_ok: + return dict(one_shot_ok=True, final_ok=True, attempts_used=1, + tokens_prompt=tokens_p, tokens_completion=tokens_c) + except Exception as e: + err = str(e); one_shot_ok = False; prev = out + else: + err = "Structure valid but values differ from expected gold."; prev = out + + for i in range(1, MAX_ATTEMPTS): + repair_prompt = make_json_repair_prompt(prev, err) + out, p, c = llm_call_json_plain(model, repair_prompt, schema_model); tokens_p += p; tokens_c += c + try: + parsed = json.loads(out) + validate_fn(parsed) + parsed = canonical_json(normalize_by_key(parsed, canon_case), canon_case) + final_ok = (parsed == gold_obj) + if final_ok: + return dict(one_shot_ok=one_shot_ok, final_ok=True, attempts_used=i+1, + tokens_prompt=tokens_p, tokens_completion=tokens_c) + else: + err = "Structure valid but values differ from expected gold." + prev = out + except Exception as e: + err = str(e); prev = out + continue + + return dict(one_shot_ok=one_shot_ok, final_ok=False, attempts_used=MAX_ATTEMPTS, + tokens_prompt=tokens_p, tokens_completion=tokens_c) + +def eval_toon_track(model: str, make_prompt_fn, validate_fn, gold_obj, canon_case: str): + tokens_p = tokens_c = 0 + prompt = make_prompt_fn() + out, p, c = llm_call_plain(model, prompt); tokens_p += p; tokens_c += c + try: + decoded = decode_toon_to_json(out) + # print(f"TOON decoded: {decoded}") + validate_fn(decoded) + decoded = canonical_json(normalize_by_key(decoded, canon_case), canon_case) + one_shot_ok = final_ok = (decoded == gold_obj) + if final_ok: + return dict(one_shot_ok=True, final_ok=True, attempts_used=1, + tokens_prompt=tokens_p, tokens_completion=tokens_c) + except Exception as e: + err = str(e); one_shot_ok = False; prev = out + else: + err = "Structure valid but values differ from expected gold."; prev = out + + for i in range(1, MAX_ATTEMPTS): + repair_prompt = make_toon_repair_prompt(prev, err) + out, p, c = llm_call_plain(model, repair_prompt); tokens_p += p; tokens_c += c + try: + decoded = decode_toon_to_json(out) + validate_fn(decoded) + decoded = canonical_json(normalize_by_key(decoded, canon_case), canon_case) + final_ok = (decoded == gold_obj) + if final_ok: + return dict(one_shot_ok=one_shot_ok, final_ok=True, attempts_used=i+1, + tokens_prompt=tokens_p, tokens_completion=tokens_c) + else: + err = "Structure valid but values differ from expected gold." + prev = out + except Exception as e: + err = str(e); prev = out + continue + + return dict(one_shot_ok=one_shot_ok, final_ok=False, attempts_used=MAX_ATTEMPTS, + tokens_prompt=tokens_p, tokens_completion=tokens_c) + +# ========================================= +# Case runners aggregating metrics +# ========================================= +def run_case_users(model: str): + gold = json.loads(USERS_JSON.read_text(encoding="utf-8")) + gold = canonical_json(gold, "users") + jm = eval_json_track(model, make_json_prompt_users, UsersPayload, validate_users_json, gold, "users") + jpm = eval_json_plain_track(model, make_json_prompt_users, UsersPayload, validate_users_json, gold, "users") + tm = eval_toon_track(model, make_toon_prompt_users, validate_users_json, gold, "users") + return { + "users_json_one_shot": jm["one_shot_ok"], "users_json_final": jm["final_ok"], + "users_json_attempts": jm["attempts_used"], + "users_json_tokens_prompt": jm["tokens_prompt"], "users_json_tokens_completion": jm["tokens_completion"], + "users_json_plain_one_shot": jpm["one_shot_ok"], "users_json_plain_final": jpm["final_ok"], + "users_json_plain_attempts": jpm["attempts_used"], + "users_json_plain_tokens_prompt": jpm["tokens_prompt"], "users_json_plain_tokens_completion": jpm["tokens_completion"], + "users_toon_one_shot": tm["one_shot_ok"], "users_toon_final": tm["final_ok"], + "users_toon_attempts": tm["attempts_used"], + "users_toon_tokens_prompt": tm["tokens_prompt"], "users_toon_tokens_completion": tm["tokens_completion"], + } + +def run_case_order(model: str): + gold = json.loads(ORDER_JSON.read_text(encoding="utf-8")) + gold = canonical_json(gold, "order") + jm = eval_json_track(model, make_json_prompt_order, Order, validate_order_json, gold, "order") + jpm = eval_json_plain_track(model, make_json_prompt_order, Order, validate_order_json, gold, "order") + tm = eval_toon_track(model, make_toon_prompt_order, validate_order_json, gold, "order") + return { + "order_json_one_shot": jm["one_shot_ok"], "order_json_final": jm["final_ok"], + "order_json_attempts": jm["attempts_used"], + "order_json_tokens_prompt": jm["tokens_prompt"], "order_json_tokens_completion": jm["tokens_completion"], + "order_json_plain_one_shot": jpm["one_shot_ok"], "order_json_plain_final": jpm["final_ok"], + "order_json_plain_attempts": jpm["attempts_used"], + "order_json_plain_tokens_prompt": jpm["tokens_prompt"], "order_json_plain_tokens_completion": jpm["tokens_completion"], + "order_toon_one_shot": tm["one_shot_ok"], "order_toon_final": tm["final_ok"], + "order_toon_attempts": tm["attempts_used"], + "order_toon_tokens_prompt": tm["tokens_prompt"], "order_toon_tokens_completion": tm["tokens_completion"], + } + +def run_case_company(model: str): + gold = json.loads(COMPANY_JSON.read_text(encoding="utf-8")) + gold = canonical_json(gold, "company") + jm = eval_json_track(model, make_json_prompt_company, Company, validate_company_json, gold, "company") + jpm = eval_json_plain_track(model, make_json_prompt_company, Company, validate_company_json, gold, "company") + tm = eval_toon_track(model, make_toon_prompt_company, validate_company_json, gold, "company") + return { + "company_json_one_shot": jm["one_shot_ok"], "company_json_final": jm["final_ok"], + "company_json_attempts": jm["attempts_used"], + "company_json_tokens_prompt": jm["tokens_prompt"], "company_json_tokens_completion": jm["tokens_completion"], + "company_json_plain_one_shot": jpm["one_shot_ok"], "company_json_plain_final": jpm["final_ok"], + "company_json_plain_attempts": jpm["attempts_used"], + "company_json_plain_tokens_prompt": jpm["tokens_prompt"], "company_json_plain_tokens_completion": jpm["tokens_completion"], + "company_toon_one_shot": tm["one_shot_ok"], "company_toon_final": tm["final_ok"], + "company_toon_attempts": tm["attempts_used"], + "company_toon_tokens_prompt": tm["tokens_prompt"], "company_toon_tokens_completion": tm["tokens_completion"], + } + +def run_case_invoice(model: str): + gold = json.loads(INVOICE_JSON.read_text(encoding="utf-8")) + gold = canonical_json(gold, "invoice") + jm = eval_json_track(model, make_json_prompt_invoice, Invoice, validate_invoice_json, gold, "invoice") + jpm = eval_json_plain_track(model, make_json_prompt_invoice, Invoice, validate_invoice_json, gold, "invoice") + tm = eval_toon_track(model, make_toon_prompt_invoice, validate_invoice_json, gold, "invoice") + return { + "invoice_json_one_shot": jm["one_shot_ok"], "invoice_json_final": jm["final_ok"], + "invoice_json_attempts": jm["attempts_used"], + "invoice_json_tokens_prompt": jm["tokens_prompt"], "invoice_json_tokens_completion": jm["tokens_completion"], + "invoice_json_plain_one_shot": jpm["one_shot_ok"], "invoice_json_plain_final": jpm["final_ok"], + "invoice_json_plain_attempts": jpm["attempts_used"], + "invoice_json_plain_tokens_prompt": jpm["tokens_prompt"], "invoice_json_plain_tokens_completion": jpm["tokens_completion"], + "invoice_toon_one_shot": tm["one_shot_ok"], "invoice_toon_final": tm["final_ok"], + "invoice_toon_attempts": tm["attempts_used"], + "invoice_toon_tokens_prompt": tm["tokens_prompt"], "invoice_toon_tokens_completion": tm["tokens_completion"], + } + +# ========================================= +# Summary helpers +# ========================================= +def summarize_formats(results: Dict[str, Any]) -> Dict[str, Any]: + cases = ["users", "order", "company", "invoice"] + summary = {} + for fmt in ["json", "json_plain", "toon"]: + one_shot_hits = sum(1 for case in cases if results.get(f"{case}_{fmt}_one_shot")) + final_hits = sum(1 for case in cases if results.get(f"{case}_{fmt}_final")) + n = len(cases) + prompt_tokens = sum(results.get(f"{case}_{fmt}_tokens_prompt", 0) for case in cases) + comp_tokens = sum(results.get(f"{case}_{fmt}_tokens_completion", 0) for case in cases) + summary[f"{fmt}_one_shot_accuracy"] = one_shot_hits / n if n else 0.0 + summary[f"{fmt}_final_accuracy"] = final_hits / n if n else 0.0 + summary[f"{fmt}_prompt_tokens"] = prompt_tokens + summary[f"{fmt}_completion_tokens"] = comp_tokens + summary[f"{fmt}_total_tokens"] = prompt_tokens + comp_tokens + summary["overall_prompt_tokens"] = summary["json_prompt_tokens"] + summary["json_plain_prompt_tokens"] + summary["toon_prompt_tokens"] + summary["overall_completion_tokens"] = summary["json_completion_tokens"] + summary["json_plain_completion_tokens"] + summary["toon_completion_tokens"] + summary["overall_total_tokens"] = summary["json_total_tokens"] + summary["json_plain_total_tokens"] + summary["toon_total_tokens"] + return summary + +def flatten_for_csv(model: str, run_idx: int, results: Dict[str, Any]) -> Dict[str, Any]: + row = {"model": model, "run": run_idx} + for case in ["users", "order", "company", "invoice"]: + for fmt in ["json", "json_plain", "toon"]: + row[f"{case}_{fmt}_one_shot"] = results.get(f"{case}_{fmt}_one_shot", False) + row[f"{case}_{fmt}_final"] = results.get(f"{case}_{fmt}_final", False) + row[f"{case}_{fmt}_attempts"] = results.get(f"{case}_{fmt}_attempts", 0) + row[f"{case}_{fmt}_prompt_tokens"] = results.get(f"{case}_{fmt}_tokens_prompt", 0) + row[f"{case}_{fmt}_completion_tokens"] = results.get(f"{case}_{fmt}_tokens_completion", 0) + summary = summarize_formats(results) + row.update({ + "json_one_shot_accuracy": summary["json_one_shot_accuracy"], + "json_final_accuracy": summary["json_final_accuracy"], + "json_prompt_tokens": summary["json_prompt_tokens"], + "json_completion_tokens": summary["json_completion_tokens"], + "json_total_tokens": summary["json_total_tokens"], + "json_plain_one_shot_accuracy": summary["json_plain_one_shot_accuracy"], + "json_plain_final_accuracy": summary["json_plain_final_accuracy"], + "json_plain_prompt_tokens": summary["json_plain_prompt_tokens"], + "json_plain_completion_tokens": summary["json_plain_completion_tokens"], + "json_plain_total_tokens": summary["json_plain_total_tokens"], + "toon_one_shot_accuracy": summary["toon_one_shot_accuracy"], + "toon_final_accuracy": summary["toon_final_accuracy"], + "toon_prompt_tokens": summary["toon_prompt_tokens"], + "toon_completion_tokens": summary["toon_completion_tokens"], + "toon_total_tokens": summary["toon_total_tokens"], + "overall_prompt_tokens": summary["overall_prompt_tokens"], + "overall_completion_tokens": summary["overall_completion_tokens"], + "overall_total_tokens": summary["overall_total_tokens"], + }) + return row + +# ========================================= +# Main (iterate models × runs, write CSV) +# ========================================= +if __name__ == "__main__": + header_fields = ["model", "run"] + for case in ["users", "order", "company", "invoice"]: + for fmt in ["json", "json_plain", "toon"]: + header_fields += [ + f"{case}_{fmt}_one_shot", + f"{case}_{fmt}_final", + f"{case}_{fmt}_attempts", + f"{case}_{fmt}_prompt_tokens", + f"{case}_{fmt}_completion_tokens", + ] + header_fields += [ + "json_one_shot_accuracy","json_final_accuracy", + "json_prompt_tokens","json_completion_tokens","json_total_tokens", + "json_plain_one_shot_accuracy","json_plain_final_accuracy", + "json_plain_prompt_tokens","json_plain_completion_tokens","json_plain_total_tokens", + "toon_one_shot_accuracy","toon_final_accuracy", + "toon_prompt_tokens","toon_completion_tokens","toon_total_tokens", + "overall_prompt_tokens","overall_completion_tokens","overall_total_tokens", + ] + + write_header = not CSV_PATH.exists() + with CSV_PATH.open("a", newline="", encoding="utf-8") as f: + writer = csv.DictWriter(f, fieldnames=header_fields) + if write_header: + writer.writeheader() + + for model in MODELS: + print(f"Processing {model}...") + for run_idx in range(1, RUNS_PER_MODEL + 1): + print(f"Run {run_idx}...") + results: Dict[str, Any] = {} + results.update(run_case_users(model)) + print("Users done") + results.update(run_case_order(model)) + print("Order done") + results.update(run_case_company(model)) + print("Company done") + results.update(run_case_invoice(model)) + print("Invoice done") + row = flatten_for_csv(model, run_idx, results) + writer.writerow(row) + + print(f"Wrote per-run stats to {CSV_PATH.resolve()}") \ No newline at end of file diff --git a/benchmarks/generation/generate.py b/benchmarks/generation/generate.py new file mode 100644 index 00000000..8646c552 --- /dev/null +++ b/benchmarks/generation/generate.py @@ -0,0 +1,168 @@ +# generate.py +from typing import List, Literal, Optional +import json +import subprocess +from pathlib import Path +from pydantic import BaseModel, ConfigDict, Field + +# ---------- Pydantic models (simple) ---------- +class UserRow(BaseModel): + model_config = ConfigDict(extra='forbid', strict=True) + id: int + name: str = Field(min_length=1) + role: Literal['admin', 'staff', 'guest'] + +class Customer(BaseModel): + model_config = ConfigDict(extra='forbid', strict=True) + id: int + name: str + +class OrderItem(BaseModel): + model_config = ConfigDict(extra='forbid', strict=True) + sku: str + qty: int + price: float + +class Order(BaseModel): + model_config = ConfigDict(extra='forbid', strict=True) + id: int + customer: Customer + items: List[OrderItem] + + +# ---------- Pydantic models (more complex #1: company) ---------- +class Employee(BaseModel): + model_config = ConfigDict(extra='forbid', strict=True) + id: int + name: str + title: Literal['engineer', 'manager', 'analyst'] + +class Department(BaseModel): + model_config = ConfigDict(extra='forbid', strict=True) + code: str + name: str + employees: List[Employee] + +class Company(BaseModel): + model_config = ConfigDict(extra='forbid', strict=True) + id: int + name: str + departments: List[Department] + + +# ---------- Pydantic models (more complex #2: invoice) ---------- +class InvoiceLine(BaseModel): + model_config = ConfigDict(extra='forbid', strict=True) + sku: str + qty: int + unit_price: float + line_total: float # keep explicit to avoid computed logic here + +class Totals(BaseModel): + model_config = ConfigDict(extra='forbid', strict=True) + subtotal: float + tax: float + grand_total: float + +class Invoice(BaseModel): + model_config = ConfigDict(extra='forbid', strict=True) + number: str + currency: Literal['USD', 'EUR', 'SAR'] + customer: Customer + items: List[InvoiceLine] + totals: Totals + notes: Optional[str] = None + + +# ---------- Create gold Python objects ---------- +# 1) Tabular users +users = [ + UserRow(id=1, name="Alice", role="admin"), + UserRow(id=2, name="Bob", role="staff"), + UserRow(id=3, name="Eve", role="guest"), +] +users_gold = {"users": [u.model_dump() for u in users]} + +# 2) Nested order +order_gold = Order( + id=101, + customer=Customer(id=9, name="Ada"), + items=[ + OrderItem(sku="A1", qty=2, price=9.99), + OrderItem(sku="B2", qty=1, price=14.50), + ], +).model_dump() + +# 3) More complex: company with nested tabular arrays +company_gold = Company( + id=1, + name="Acme", + departments=[ + Department( + code="ENG", + name="Engineering", + employees=[ + Employee(id=1, name="Alice", title="engineer"), + Employee(id=2, name="Bob", title="manager"), + ], + ), + Department( + code="OPS", + name="Operations", + employees=[ + Employee(id=3, name="Eve", title="analyst"), + ], + ), + ], +).model_dump() + +# 4) More complex: invoice with nested objects + tabular line items +invoice_gold = Invoice( + number="INV-2025-001", + currency="USD", + customer=Customer(id=9, name="Ada"), + items=[ + InvoiceLine(sku="A1", qty=2, unit_price=9.99, line_total=19.98), + InvoiceLine(sku="B2", qty=1, unit_price=14.50, line_total=14.50), + ], + totals=Totals(subtotal=34.48, tax=6.90, grand_total=41.38), + notes="Thank you for your business.", +).model_dump() + + +# ---------- Write gold JSON to disk ---------- +outdir = Path("gold") +outdir.mkdir(exist_ok=True) + +def write_json(path: Path, obj) -> None: + path.write_text(json.dumps(obj, ensure_ascii=False, separators=(",", ":")), encoding="utf-8") + +users_json_path = outdir / "users.gold.json" +order_json_path = outdir / "order.gold.json" +company_json_path = outdir / "company.gold.json" +invoice_json_path = outdir / "invoice.gold.json" + +write_json(users_json_path, users_gold) +write_json(order_json_path, order_gold) +write_json(company_json_path, company_gold) +write_json(invoice_json_path, invoice_gold) + + +# ---------- Use TOON CLI via npx to encode JSON -> TOON ---------- +def encode_to_toon(json_path: Path, toon_path: Path) -> None: + subprocess.run( + ["npx", "@toon-format/cli", str(json_path), "-o", str(toon_path)], + check=True, + ) + +encode_to_toon(users_json_path, outdir / "users.gold.toon") +encode_to_toon(order_json_path, outdir / "order.gold.toon") +encode_to_toon(company_json_path, outdir / "company.gold.toon") +encode_to_toon(invoice_json_path, outdir / "invoice.gold.toon") + +print("Wrote:") +for p in [users_json_path, outdir / "users.gold.toon", + order_json_path, outdir / "order.gold.toon", + company_json_path, outdir / "company.gold.toon", + invoice_json_path, outdir / "invoice.gold.toon"]: + print(f" {p}") diff --git a/benchmarks/generation/gold/company.gold.json b/benchmarks/generation/gold/company.gold.json new file mode 100644 index 00000000..77238a84 --- /dev/null +++ b/benchmarks/generation/gold/company.gold.json @@ -0,0 +1 @@ +{"id":1,"name":"Acme","departments":[{"code":"ENG","name":"Engineering","employees":[{"id":1,"name":"Alice","title":"engineer"},{"id":2,"name":"Bob","title":"manager"}]},{"code":"OPS","name":"Operations","employees":[{"id":3,"name":"Eve","title":"analyst"}]}]} \ No newline at end of file diff --git a/benchmarks/generation/gold/company.gold.toon b/benchmarks/generation/gold/company.gold.toon new file mode 100644 index 00000000..43b015bb --- /dev/null +++ b/benchmarks/generation/gold/company.gold.toon @@ -0,0 +1,12 @@ +id: 1 +name: Acme +departments[2]: + - code: ENG + name: Engineering + employees[2]{id,name,title}: + 1,Alice,engineer + 2,Bob,manager + - code: OPS + name: Operations + employees[1]{id,name,title}: + 3,Eve,analyst \ No newline at end of file diff --git a/benchmarks/generation/gold/invoice.gold.json b/benchmarks/generation/gold/invoice.gold.json new file mode 100644 index 00000000..6b3ce1d0 --- /dev/null +++ b/benchmarks/generation/gold/invoice.gold.json @@ -0,0 +1 @@ +{"number":"INV-2025-001","currency":"USD","customer":{"id":9,"name":"Ada"},"items":[{"sku":"A1","qty":2,"unit_price":9.99,"line_total":19.98},{"sku":"B2","qty":1,"unit_price":14.5,"line_total":14.5}],"totals":{"subtotal":34.48,"tax":6.9,"grand_total":41.38},"notes":"Thank you for your business."} \ No newline at end of file diff --git a/benchmarks/generation/gold/invoice.gold.toon b/benchmarks/generation/gold/invoice.gold.toon new file mode 100644 index 00000000..d1e93c91 --- /dev/null +++ b/benchmarks/generation/gold/invoice.gold.toon @@ -0,0 +1,13 @@ +number: INV-2025-001 +currency: USD +customer: + id: 9 + name: Ada +items[2]{sku,qty,unit_price,line_total}: + A1,2,9.99,19.98 + B2,1,14.5,14.5 +totals: + subtotal: 34.48 + tax: 6.9 + grand_total: 41.38 +notes: Thank you for your business. \ No newline at end of file diff --git a/benchmarks/generation/gold/order.gold.json b/benchmarks/generation/gold/order.gold.json new file mode 100644 index 00000000..c3c2a604 --- /dev/null +++ b/benchmarks/generation/gold/order.gold.json @@ -0,0 +1 @@ +{"id":101,"customer":{"id":9,"name":"Ada"},"items":[{"sku":"A1","qty":2,"price":9.99},{"sku":"B2","qty":1,"price":14.5}]} \ No newline at end of file diff --git a/benchmarks/generation/gold/order.gold.toon b/benchmarks/generation/gold/order.gold.toon new file mode 100644 index 00000000..ebc448fc --- /dev/null +++ b/benchmarks/generation/gold/order.gold.toon @@ -0,0 +1,7 @@ +id: 101 +customer: + id: 9 + name: Ada +items[2]{sku,qty,price}: + A1,2,9.99 + B2,1,14.5 \ No newline at end of file diff --git a/benchmarks/generation/gold/users.gold.json b/benchmarks/generation/gold/users.gold.json new file mode 100644 index 00000000..ac535be2 --- /dev/null +++ b/benchmarks/generation/gold/users.gold.json @@ -0,0 +1 @@ +{"users":[{"id":1,"name":"Alice","role":"admin"},{"id":2,"name":"Bob","role":"staff"},{"id":3,"name":"Eve","role":"guest"}]} \ No newline at end of file diff --git a/benchmarks/generation/gold/users.gold.toon b/benchmarks/generation/gold/users.gold.toon new file mode 100644 index 00000000..f19feaad --- /dev/null +++ b/benchmarks/generation/gold/users.gold.toon @@ -0,0 +1,4 @@ +users[3]{id,name,role}: + 1,Alice,admin + 2,Bob,staff + 3,Eve,guest \ No newline at end of file diff --git a/benchmarks/generation/requirements.txt b/benchmarks/generation/requirements.txt new file mode 100644 index 00000000..4505df10 --- /dev/null +++ b/benchmarks/generation/requirements.txt @@ -0,0 +1,3 @@ +pydantic>=2.5.0 +openai>=1.0.0 +pandas>=2.0.0