Skip to content

[RFC]: Add \uXXXX unicode escape sequences for control characters in strings #39

@cafreeman

Description

@cafreeman

Type of Change

  • Breaking change (incompatible with current spec)
  • Backward-compatible addition
  • Clarification or editorial improvement
  • New optional feature
  • Changes to conformance requirements

Summary

Add \uXXXX (4 hex digit unicode escape) as a 6th escape form in quoted strings and keys, enabling TOON to represent control characters that are currently unrepresentable.

Motivation

Problem

The TOON spec defines 5 escape sequences: \\, \", \n, \r, \t. Control characters in the range U+0000–U+001F (excluding U+000A, U+000D, U+0009) cannot appear literally in strings and have no escape syntax — they are unrepresentable in TOON.

This means any Go/TypeScript/Python value containing these characters (e.g. U+0004 EOT, U+0000 NUL, U+001B ESC) cannot be serialized to TOON at all. TOON encoders must reject the input entirely.

Use Case

Real-world strings from databases, search indices, APIs, and binary-adjacent text processing commonly contain control characters as sentinel values or delimiters. A format that claims lossless JSON data model serialization should be able to represent any valid Unicode string that JSON can.

Benefits

  • Closes the representability gap between TOON and JSON
  • Enables lossless round-tripping of arbitrary string data
  • Follows an escape syntax already familiar to developers from JSON, JavaScript, Java, C#, Python, etc.

Detailed Design

Proposed Syntax

In quoted strings and quoted keys, add support for \uXXXX where XXXX is exactly 4 hexadecimal digits (0–9, a–f, A–F), representing a Unicode code point.

escaped-char  = "\" ( "\" / DQUOTE / "n" / "r" / "t" / unicode-escape )
unicode-escape = "u" 4HEXDIG

Encoding Rules

  • Encoders MUST use \uXXXX for control characters < U+0020 that are not \n, \r, or \t (since these have no other representation in TOON).
  • Encoders SHOULD prefer the shorthand escapes \n, \r, \t for U+000A, U+000D, U+0009 respectively.
  • Encoders MAY use \uXXXX for any Unicode code point, but SHOULD prefer literal UTF-8 for printable characters to preserve human readability (a core TOON design goal).

Decoding Rules

  • Decoders MUST parse \uXXXX in quoted strings and quoted keys, converting the 4 hex digits to the corresponding Unicode code point.
  • Decoders MUST reject \u followed by fewer than 4 hex digits as an invalid escape sequence.

Grammar Changes (ABNF)

Current:

escaped-char = "\" ( "\" / DQUOTE / "n" / "r" / "t" )

Proposed:

escaped-char   = "\" ( "\" / DQUOTE / "n" / "r" / "t" / unicode-escape )
unicode-escape = "u" 4HEXDIG

Examples

Encoding a string with U+0004 (EOT)

Input (JSON):

{"marker": "hello\u0004world"}

Output (TOON):

marker: "hello\u0004world"

Round-trip of various control characters

Input (JSON):

{"controls": "\u0000\u0001\u001f"}

Output (TOON):

controls: "\u0000\u0001\u001F"

Mixed with existing escapes

Input (JSON):

{"mixed": "line1\nline2\u0004end"}

Output (TOON):

mixed: "line1\nline2\u0004end"

Drawbacks

  • Slightly increased parser complexity: Decoders must handle a new escape form, including reading exactly 4 hex digits and converting to a rune/code point. This is straightforward but non-trivial.
  • Token cost: \uXXXX is 6 characters to represent 1 character. However, this only applies to control characters that are already rare in typical data, so the impact on TOON's token efficiency goals is negligible.
  • Not needed for most documents: The vast majority of TOON documents contain no control characters beyond \n, \r, \t. This feature primarily unblocks edge cases rather than improving the common path.

Alternatives Considered

Alternative 1: Do nothing — leave control characters unrepresentable

Callers must sanitize strings before TOON encoding (strip or replace control chars). Rejected because this makes TOON lossy — decoded output won't match the original input, breaking the "lossless serialization of the JSON data model" guarantee.

Alternative 2: Allow raw control characters in quoted strings

Let encoders emit literal bytes for control characters inside quotes. Rejected because raw control characters cause problems with text editors, terminals, copy-paste, and other tooling. JSON explicitly forbids this for good reason.

Alternative 3: Use a different escape syntax (e.g. \xHH)

\xHH is 2 hex digits (byte-level). Rejected because it's ambiguous with multi-byte UTF-8 sequences and isn't widely standardized across languages. \uXXXX is the most universally recognized unicode escape syntax.

Impact on Implementations

  • Reference implementation (TypeScript): Requires adding \uXXXX emit in the encoder for control chars < 0x20, and \uXXXX parsing in the decoder's escape handling.
  • Community implementations (Go, Python, Java, Swift, Julia, Ruby): Same scope — encoder and decoder string handling. Typically 10-30 lines of code per implementation.
  • Backward compatibility: Existing valid TOON documents remain valid. Documents using \uXXXX will only be parseable by updated decoders. Old decoders will correctly reject \u as an unknown escape (per current spec: "decoders MUST reject any other escape sequence").
  • Versioning: This is a backward-compatible addition (new documents may use it; old documents are unaffected). Appropriate for a MINOR version bump per VERSIONING.md.

Migration Strategy

For Implementers

  1. Update decoder to handle case 'u': in escape sequence parsing — read 4 hex digits, convert to code point
  2. Update encoder to emit \uXXXX for runes < U+0020 that aren't \n/\r/\t, instead of rejecting them
  3. Add round-trip tests for control characters

For Users

No migration needed. Existing TOON documents are unaffected. New documents containing control characters will simply work instead of failing.

Test Cases

[
  {
    "name": "encode string with U+0004 EOT",
    "input": {"val": "a\u0004b"},
    "expected": "val: \"a\\u0004b\"",
    "specSection": "7.1"
  },
  {
    "name": "encode string with U+0000 NUL",
    "input": {"val": "a\u0000b"},
    "expected": "val: \"a\\u0000b\"",
    "specSection": "7.1"
  },
  {
    "name": "encode string with U+001F (last control char)",
    "input": {"val": "a\u001fb"},
    "expected": "val: \"a\\u001Fb\"",
    "specSection": "7.1"
  },
  {
    "name": "prefer \\n over \\u000A",
    "input": {"val": "a\nb"},
    "expected": "val: \"a\\nb\"",
    "specSection": "7.1",
    "note": "Encoders SHOULD prefer shorthand escapes"
  },
  {
    "name": "decode \\u0004 in quoted string",
    "category": "decode",
    "input": "val: \"a\\u0004b\"",
    "expected": {"val": "a\u0004b"},
    "specSection": "7.1"
  },
  {
    "name": "reject truncated unicode escape",
    "category": "decode",
    "input": "val: \"a\\u00b\"",
    "shouldError": true,
    "specSection": "7.1",
    "note": "\\u must be followed by exactly 4 hex digits"
  }
]

Affected Specification Sections

  • Section 7.1 (Escape Sequences): Add \uXXXX to the list of valid escapes, update ABNF grammar
  • Section 1 (Conventions): No change needed, but the new escape uses the same RFC 2119 keywords
  • Appendix / ABNF Grammar: Update escaped-char production

Unresolved Questions

  • Surrogate pairs: Should TOON support \uD800\uDFFF surrogate pair encoding for characters above U+FFFF (as JSON does via two consecutive \uXXXX escapes)? Or should TOON keep it simple and only support BMP code points, requiring literal UTF-8 for supplementary characters? The simpler option (no surrogate pairs) seems preferable since TOON is UTF-8 native and supplementary characters can appear literally.
  • Case sensitivity: Should \u001f and \u001F both be accepted? Recommendation: yes, decoders should accept both (case-insensitive hex digits), while encoders SHOULD emit uppercase for consistency.

Additional Context

  • JSON (RFC 8259) defines \uXXXX with the same semantics proposed here. This is the most widely understood unicode escape syntax across programming languages.
  • The TOON CONTRIBUTING.md already uses "Adding \u0000 escape sequences" as a canonical example of an RFC-worthy change, suggesting this gap is already on the maintainers' radar.
  • This proposal intentionally keeps scope narrow: only \uXXXX (4 hex digits, BMP). Extended forms like \U00XXXXXX (8 hex digits) or \u{XXXXX} (variable-length) are out of scope and can be considered separately if needed.

Checklist

  • I have read the RFC process in CONTRIBUTING.md
  • I have searched for similar proposals
  • I have considered backward compatibility
  • I understand this may require community discussion before acceptance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions