-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Type of Change
- Breaking change (incompatible with current spec)
- Backward-compatible addition
- Clarification or editorial improvement
- New optional feature
- Changes to conformance requirements
Summary
Description:
While TOON is already excellent at reducing token overhead by eliminating key repetition, there is still significant waste when large string values or long numbers (like IDs, hashes, or company names) repeat across multiple rows in a dataset.
I propose a native way to define Value Aliases at the start of the payload to achieve maximum compression.
The Concept (Value Aliasing):
• Define a reference at the top using the $ prefix (e.g., $1: Very Long Repetitive String)
• Use the alias ($1) throughout the document instead of the full value.
Comparative Example
Current TOON (Standard):
provider: Global Logistics and International Foods Services S.A.
items:
[product, carrier]
- Rice, Global Logistics and International Foods Services S.A.
- Beans, Global Logistics and International Foods Services S.A.
- Corn, Global Logistics and International Foods Services S.A.
Proposed TOON (With Aliasing):
provider: $1
$1: Global Logistics and International Foods Services S.A.
items:
[product, carrier]
- Rice, $1
- Beans, $1
- Corn, $1
Why this is a game-changer for LLMs:
1. Significant Cost Reduction
In a list of 50+ items, repeating a 50-character string consumes thousands of unnecessary tokens.
This method collapses that cost to nearly zero.
2. KV Cache Efficiency
Modern LLMs (GPT-4, Claude) are highly efficient at handling symbolic pointers.
Defining a variable once and referencing it later improves model focus and output speed.
3. Large Number Compression
Perfect for blockchain hashes, UUIDs, or long transaction IDs that appear multiple times in a single context.
4. Implementation Simplicity
Requires only a very simple regex-based parser on the client side to “hydrate” the data back to its original form.
Suggested Syntax
Use $n: for definitions (where n is an ID or a short mnemonic)
Use $n as a value placeholder
Motivation
Current LLM (Large Language Model) token costs are directly tied to the number of characters and repeating patterns in the input/output. While TOON already optimizes data structures by removing redundant JSON keys, it does not yet address the redundancy of repetitive long string values (such as company names, addresses, or UUIDs) within the data itself. The main problems this proposal solves are: Token Bloat in Large Datasets: When a specific value (e.g., a "Service Provider" name) repeats across 50+ rows, we are paying for those tokens 50 times. By using a single reference like $1, we collapse that cost to nearly zero. Context Window Efficiency: Long repetitive strings take up valuable space in the model's context window. Aliasing allows us to pack much more actual information into a single request. LLM Pattern Recognition: Models like GPT-4 and Claude are excellent at maintaining symbolic associations. Defining a variable once at the top ($1: Value) and referencing it later is highly reliable and reduces the risk of the model truncating long strings in the middle of a list. This "Token-Efficient Aliasing" turns TOON into a truly compressed transport format for high-volume AI agents and enterprise-level automation.Detailed Design
The proposal introduces a Reference Header section at the very top of the TOON payload, followed by the data body.
1. Syntax for Definitions:
Variable definitions must start with a $ followed by an identifier (numeric or mnemonic) and a colon separator.
Format: $ID: Long String Value
Example: $1: Global Logistics and International Foods Services S.A.
2. Syntax for Referencing:
Throughout the TOON structure (both in simple fields and inside table rows), the $ID acts as a pointer to the defined value.
Example in fields: provider: $1
Example in rows: - ItemName, $1, $2
3. Parsing Logic (The "De-aliasing" Process):
The parser should follow a two-step approach:
Pre-processing: Identify and store all lines starting with $ in a temporary dictionary/map.
Hydration: Before converting the TOON structure to a final JSON/Object, globally replace all occurrences of $ID with their corresponding mapped values.
4. Scope and Rules:
Global Scope: Variables defined at the top apply to the entire payload.
Type Neutrality: While primarily intended for long strings, this can also be used for large numbers or repetitive complex IDs (like Blockchain hashes) to ensure consistency and save tokens.Examples
Below is a comparison between the current TOON specification and the proposed Value Aliasing model.
1. Current TOON Specification (Redundant Values):
In this example, the long company name and the status are repeated multiple times, wasting tokens.
order_id: 5520
default_warehouse: Logística Global de Alimentos do Brasil S.A.
status_message: Product successfully dispatched to destination
items:
[prod, warehouse, status]
- Monitor Gamer, Logística Global de Alimentos do Brasil S.A., Product successfully dispatched to destination
- Mechanical Keyboard, Logística Global de Alimentos do Brasil S.A., Product successfully dispatched to destination
- Gaming Mouse, Logística Global de Alimentos do Brasil S.A., Product successfully dispatched to destination
2. Proposed TOON with Value Aliasing (Optimized):
The same data, but significantly more compact and token-efficient.
$1: Logística Global de Alimentos do Brasil S.A.
$2: Product successfully dispatched to destination
order_id: 5520
default_warehouse: $1
status_message: $2
items:
[prod, warehouse, status]
- Monitor Gamer, $1, $2
- Mechanical Keyboard, $1, $2
- Gaming Mouse, $1, $2Drawbacks
While the benefits in token savings are significant, there are a few trade-offs to consider:
Client-Side Processing: This introduces a small overhead for the application layer. The client (or server) must implement a "hydration" step to replace aliases with their actual values before saving data to a database.
Readability for Humans: While LLMs handle symbolic references perfectly, a raw TOON file with many $1, $2, $3 variables becomes slightly harder for a human to read at a glance compared to the full-text version.
Parsing Complexity: The parser needs to be slightly more robust to handle cases where a user might accidentally define a variable but not use it, or vice versa. However, this can be easily mitigated with simple Regex or a basic dictionary map.
Alternatives Considered
Standard JSON/YAML: These formats are natively supported but are extremely token-heavy due to repeated keys and structural syntax (brackets, quotes, indentation). They do not offer a built-in way to alias values without increasing schema complexity.
Schema-only TOON (Current): The current TOON spec handles key repetition by defining a header [key1, key2]. However, it lacks a mechanism for Value Aliasing. Without this proposal, long strings must be repeated in every row, leading to high costs in large datasets.
Compression Algorithms (Gzip/Zlib): While highly effective for storage, LLMs cannot "read" binary compressed data. We need a "semantic compression" that is human-readable and LLM-understandable, which is exactly what the $ID aliasing provides.
Positional References: Using only numbers (e.g., - Item, 1, 2) without the $ prefix. We considered this, but the $ prefix is safer as it clearly distinguishes a reference from a literal number, avoiding parsing errors.
Impact on Implementations
The introduction of Value Aliasing has a low-to-moderate impact on existing TOON parsers and generators, as it follows a non-breaking incremental approach.
Parser Updates: Current parsers will need a pre-processing step. This involves a single regex pass or a line-by-line scan to identify $ID: definitions at the beginning of the payload.
State Management: The parser must maintain a simple key-value map (dictionary) during the lifecycle of the "hydration" process.
Forward Compatibility: Existing TOON files that do not use the $ symbol will remain 100% compatible. The aliasing logic only triggers when a $ prefix is detected.
Generator Logic: Libraries that generate TOON from JSON/Objects can be optimized to automatically detect repetitive strings and convert them into aliases to save user tokens.
Migration Strategy
No response
Test Cases
Affected Specification Sections
Grammar and Syntax: A new rule must be added to support the definition of aliases using the $ prefix followed by a colon (e.g., $ID: value).
Data Types and References: Introduction of a "Reference Type" or "Pointer" to differentiate between literal strings and aliased values during the parsing process.
Header Structure: Modification to the top-level structure to allow a "Metadata/Reference Header" section before the main object or table body.
Parsing Algorithm: Addition of a mandatory "Hydration" step in the reference implementation to ensure aliases are resolved before the data is consumed by applications.
Unresolved Questions
No response
Additional Context
No response
Checklist
- I have read the RFC process in CONTRIBUTING.md
- I have searched for similar proposals
- I have considered backward compatibility
- I understand this may require community discussion before acceptance