-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Type of Change
- Breaking change (incompatible with current spec)
- Backward-compatible addition
- Clarification or editorial improvement
- New optional feature
- Changes to conformance requirements
Summary
Add \uXXXX (4 hex digit unicode escape) as a 6th escape form in quoted strings and keys, enabling TOON to represent control characters that are currently unrepresentable.
Motivation
Problem
The TOON spec defines 5 escape sequences: \\, \", \n, \r, \t. Control characters in the range U+0000–U+001F (excluding U+000A, U+000D, U+0009) cannot appear literally in strings and have no escape syntax — they are unrepresentable in TOON.
This means any Go/TypeScript/Python value containing these characters (e.g. U+0004 EOT, U+0000 NUL, U+001B ESC) cannot be serialized to TOON at all. TOON encoders must reject the input entirely.
Use Case
Real-world strings from databases, search indices, APIs, and binary-adjacent text processing commonly contain control characters as sentinel values or delimiters. A format that claims lossless JSON data model serialization should be able to represent any valid Unicode string that JSON can.
Benefits
- Closes the representability gap between TOON and JSON
- Enables lossless round-tripping of arbitrary string data
- Follows an escape syntax already familiar to developers from JSON, JavaScript, Java, C#, Python, etc.
Detailed Design
Proposed Syntax
In quoted strings and quoted keys, add support for \uXXXX where XXXX is exactly 4 hexadecimal digits (0–9, a–f, A–F), representing a Unicode code point.
escaped-char = "\" ( "\" / DQUOTE / "n" / "r" / "t" / unicode-escape )
unicode-escape = "u" 4HEXDIG
Encoding Rules
- Encoders MUST use
\uXXXXfor control characters < U+0020 that are not\n,\r, or\t(since these have no other representation in TOON). - Encoders SHOULD prefer the shorthand escapes
\n,\r,\tfor U+000A, U+000D, U+0009 respectively. - Encoders MAY use
\uXXXXfor any Unicode code point, but SHOULD prefer literal UTF-8 for printable characters to preserve human readability (a core TOON design goal).
Decoding Rules
- Decoders MUST parse
\uXXXXin quoted strings and quoted keys, converting the 4 hex digits to the corresponding Unicode code point. - Decoders MUST reject
\ufollowed by fewer than 4 hex digits as an invalid escape sequence.
Grammar Changes (ABNF)
Current:
escaped-char = "\" ( "\" / DQUOTE / "n" / "r" / "t" )Proposed:
escaped-char = "\" ( "\" / DQUOTE / "n" / "r" / "t" / unicode-escape )
unicode-escape = "u" 4HEXDIGExamples
Encoding a string with U+0004 (EOT)
Input (JSON):
{"marker": "hello\u0004world"}Output (TOON):
marker: "hello\u0004world"
Round-trip of various control characters
Input (JSON):
{"controls": "\u0000\u0001\u001f"}Output (TOON):
controls: "\u0000\u0001\u001F"
Mixed with existing escapes
Input (JSON):
{"mixed": "line1\nline2\u0004end"}Output (TOON):
mixed: "line1\nline2\u0004end"
Drawbacks
- Slightly increased parser complexity: Decoders must handle a new escape form, including reading exactly 4 hex digits and converting to a rune/code point. This is straightforward but non-trivial.
- Token cost:
\uXXXXis 6 characters to represent 1 character. However, this only applies to control characters that are already rare in typical data, so the impact on TOON's token efficiency goals is negligible. - Not needed for most documents: The vast majority of TOON documents contain no control characters beyond
\n,\r,\t. This feature primarily unblocks edge cases rather than improving the common path.
Alternatives Considered
Alternative 1: Do nothing — leave control characters unrepresentable
Callers must sanitize strings before TOON encoding (strip or replace control chars). Rejected because this makes TOON lossy — decoded output won't match the original input, breaking the "lossless serialization of the JSON data model" guarantee.
Alternative 2: Allow raw control characters in quoted strings
Let encoders emit literal bytes for control characters inside quotes. Rejected because raw control characters cause problems with text editors, terminals, copy-paste, and other tooling. JSON explicitly forbids this for good reason.
Alternative 3: Use a different escape syntax (e.g. \xHH)
\xHH is 2 hex digits (byte-level). Rejected because it's ambiguous with multi-byte UTF-8 sequences and isn't widely standardized across languages. \uXXXX is the most universally recognized unicode escape syntax.
Impact on Implementations
- Reference implementation (TypeScript): Requires adding
\uXXXXemit in the encoder for control chars < 0x20, and\uXXXXparsing in the decoder's escape handling. - Community implementations (Go, Python, Java, Swift, Julia, Ruby): Same scope — encoder and decoder string handling. Typically 10-30 lines of code per implementation.
- Backward compatibility: Existing valid TOON documents remain valid. Documents using
\uXXXXwill only be parseable by updated decoders. Old decoders will correctly reject\uas an unknown escape (per current spec: "decoders MUST reject any other escape sequence"). - Versioning: This is a backward-compatible addition (new documents may use it; old documents are unaffected). Appropriate for a MINOR version bump per VERSIONING.md.
Migration Strategy
For Implementers
- Update decoder to handle
case 'u':in escape sequence parsing — read 4 hex digits, convert to code point - Update encoder to emit
\uXXXXfor runes < U+0020 that aren't\n/\r/\t, instead of rejecting them - Add round-trip tests for control characters
For Users
No migration needed. Existing TOON documents are unaffected. New documents containing control characters will simply work instead of failing.
Test Cases
[
{
"name": "encode string with U+0004 EOT",
"input": {"val": "a\u0004b"},
"expected": "val: \"a\\u0004b\"",
"specSection": "7.1"
},
{
"name": "encode string with U+0000 NUL",
"input": {"val": "a\u0000b"},
"expected": "val: \"a\\u0000b\"",
"specSection": "7.1"
},
{
"name": "encode string with U+001F (last control char)",
"input": {"val": "a\u001fb"},
"expected": "val: \"a\\u001Fb\"",
"specSection": "7.1"
},
{
"name": "prefer \\n over \\u000A",
"input": {"val": "a\nb"},
"expected": "val: \"a\\nb\"",
"specSection": "7.1",
"note": "Encoders SHOULD prefer shorthand escapes"
},
{
"name": "decode \\u0004 in quoted string",
"category": "decode",
"input": "val: \"a\\u0004b\"",
"expected": {"val": "a\u0004b"},
"specSection": "7.1"
},
{
"name": "reject truncated unicode escape",
"category": "decode",
"input": "val: \"a\\u00b\"",
"shouldError": true,
"specSection": "7.1",
"note": "\\u must be followed by exactly 4 hex digits"
}
]Affected Specification Sections
- Section 7.1 (Escape Sequences): Add
\uXXXXto the list of valid escapes, update ABNF grammar - Section 1 (Conventions): No change needed, but the new escape uses the same RFC 2119 keywords
- Appendix / ABNF Grammar: Update
escaped-charproduction
Unresolved Questions
- Surrogate pairs: Should TOON support
\uD800–\uDFFFsurrogate pair encoding for characters above U+FFFF (as JSON does via two consecutive\uXXXXescapes)? Or should TOON keep it simple and only support BMP code points, requiring literal UTF-8 for supplementary characters? The simpler option (no surrogate pairs) seems preferable since TOON is UTF-8 native and supplementary characters can appear literally. - Case sensitivity: Should
\u001fand\u001Fboth be accepted? Recommendation: yes, decoders should accept both (case-insensitive hex digits), while encoders SHOULD emit uppercase for consistency.
Additional Context
- JSON (RFC 8259) defines
\uXXXXwith the same semantics proposed here. This is the most widely understood unicode escape syntax across programming languages. - The TOON CONTRIBUTING.md already uses "Adding
\u0000escape sequences" as a canonical example of an RFC-worthy change, suggesting this gap is already on the maintainers' radar. - This proposal intentionally keeps scope narrow: only
\uXXXX(4 hex digits, BMP). Extended forms like\U00XXXXXX(8 hex digits) or\u{XXXXX}(variable-length) are out of scope and can be considered separately if needed.
Checklist
- I have read the RFC process in CONTRIBUTING.md
- I have searched for similar proposals
- I have considered backward compatibility
- I understand this may require community discussion before acceptance