[RFC]: Add \uXXXX unicode escape sequences for control characters in strings

### Type of Change

- [ ] Breaking change (incompatible with current spec)
- [x] Backward-compatible addition
- [ ] Clarification or editorial improvement
- [x] New optional feature
- [ ] Changes to conformance requirements

### Summary

Add `\uXXXX` (4 hex digit unicode escape) as a 6th escape form in quoted strings and keys, enabling TOON to represent control characters that are currently unrepresentable.

### Motivation

## Problem

The TOON spec defines 5 escape sequences: `\\`, `\"`, `\n`, `\r`, `\t`. Control characters in the range U+0000–U+001F (excluding U+000A, U+000D, U+0009) cannot appear literally in strings and have no escape syntax — they are **unrepresentable in TOON**.

This means any Go/TypeScript/Python value containing these characters (e.g. U+0004 EOT, U+0000 NUL, U+001B ESC) cannot be serialized to TOON at all. TOON encoders must reject the input entirely.

## Use Case

Real-world strings from databases, search indices, APIs, and binary-adjacent text processing commonly contain control characters as sentinel values or delimiters. A format that claims lossless JSON data model serialization should be able to represent any valid Unicode string that JSON can.

## Benefits

- Closes the representability gap between TOON and JSON
- Enables lossless round-tripping of arbitrary string data
- Follows an escape syntax already familiar to developers from JSON, JavaScript, Java, C#, Python, etc.

### Detailed Design

## Proposed Syntax

In quoted strings and quoted keys, add support for `\uXXXX` where `XXXX` is exactly 4 hexadecimal digits (0–9, a–f, A–F), representing a Unicode code point.

```
escaped-char  = "\" ( "\" / DQUOTE / "n" / "r" / "t" / unicode-escape )
unicode-escape = "u" 4HEXDIG
```

## Encoding Rules

- Encoders MUST use `\uXXXX` for control characters < U+0020 that are not `\n`, `\r`, or `\t` (since these have no other representation in TOON).
- Encoders SHOULD prefer the shorthand escapes `\n`, `\r`, `\t` for U+000A, U+000D, U+0009 respectively.
- Encoders MAY use `\uXXXX` for any Unicode code point, but SHOULD prefer literal UTF-8 for printable characters to preserve human readability (a core TOON design goal).

## Decoding Rules

- Decoders MUST parse `\uXXXX` in quoted strings and quoted keys, converting the 4 hex digits to the corresponding Unicode code point.
- Decoders MUST reject `\u` followed by fewer than 4 hex digits as an invalid escape sequence.

## Grammar Changes (ABNF)

Current:
```abnf
escaped-char = "\" ( "\" / DQUOTE / "n" / "r" / "t" )
```

Proposed:
```abnf
escaped-char   = "\" ( "\" / DQUOTE / "n" / "r" / "t" / unicode-escape )
unicode-escape = "u" 4HEXDIG
```

### Examples

### Encoding a string with U+0004 (EOT)

**Input (JSON):**
```json
{"marker": "hello\u0004world"}
```

**Output (TOON):**
```toon
marker: "hello\u0004world"
```

### Round-trip of various control characters

**Input (JSON):**
```json
{"controls": "\u0000\u0001\u001f"}
```

**Output (TOON):**
```toon
controls: "\u0000\u0001\u001F"
```

### Mixed with existing escapes

**Input (JSON):**
```json
{"mixed": "line1\nline2\u0004end"}
```

**Output (TOON):**
```toon
mixed: "line1\nline2\u0004end"
```

### Drawbacks

- **Slightly increased parser complexity**: Decoders must handle a new escape form, including reading exactly 4 hex digits and converting to a rune/code point. This is straightforward but non-trivial.
- **Token cost**: `\uXXXX` is 6 characters to represent 1 character. However, this only applies to control characters that are already rare in typical data, so the impact on TOON's token efficiency goals is negligible.
- **Not needed for most documents**: The vast majority of TOON documents contain no control characters beyond `\n`, `\r`, `\t`. This feature primarily unblocks edge cases rather than improving the common path.

### Alternatives Considered

### Alternative 1: Do nothing — leave control characters unrepresentable

Callers must sanitize strings before TOON encoding (strip or replace control chars). Rejected because this makes TOON lossy — decoded output won't match the original input, breaking the "lossless serialization of the JSON data model" guarantee.

### Alternative 2: Allow raw control characters in quoted strings

Let encoders emit literal bytes for control characters inside quotes. Rejected because raw control characters cause problems with text editors, terminals, copy-paste, and other tooling. JSON explicitly forbids this for good reason.

### Alternative 3: Use a different escape syntax (e.g. `\xHH`)

`\xHH` is 2 hex digits (byte-level). Rejected because it's ambiguous with multi-byte UTF-8 sequences and isn't widely standardized across languages. `\uXXXX` is the most universally recognized unicode escape syntax.

### Impact on Implementations

- **Reference implementation (TypeScript):** Requires adding `\uXXXX` emit in the encoder for control chars < 0x20, and `\uXXXX` parsing in the decoder's escape handling.
- **Community implementations (Go, Python, Java, Swift, Julia, Ruby):** Same scope — encoder and decoder string handling. Typically 10-30 lines of code per implementation.
- **Backward compatibility:** Existing valid TOON documents remain valid. Documents using `\uXXXX` will only be parseable by updated decoders. Old decoders will correctly reject `\u` as an unknown escape (per current spec: "decoders MUST reject any other escape sequence").
- **Versioning:** This is a backward-compatible addition (new documents may use it; old documents are unaffected). Appropriate for a MINOR version bump per VERSIONING.md.

### Migration Strategy

## For Implementers
1. Update decoder to handle `case 'u':` in escape sequence parsing — read 4 hex digits, convert to code point
2. Update encoder to emit `\uXXXX` for runes < U+0020 that aren't `\n`/`\r`/`\t`, instead of rejecting them
3. Add round-trip tests for control characters

## For Users
No migration needed. Existing TOON documents are unaffected. New documents containing control characters will simply work instead of failing.

### Test Cases

```json
[
  {
    "name": "encode string with U+0004 EOT",
    "input": {"val": "a\u0004b"},
    "expected": "val: \"a\\u0004b\"",
    "specSection": "7.1"
  },
  {
    "name": "encode string with U+0000 NUL",
    "input": {"val": "a\u0000b"},
    "expected": "val: \"a\\u0000b\"",
    "specSection": "7.1"
  },
  {
    "name": "encode string with U+001F (last control char)",
    "input": {"val": "a\u001fb"},
    "expected": "val: \"a\\u001Fb\"",
    "specSection": "7.1"
  },
  {
    "name": "prefer \\n over \\u000A",
    "input": {"val": "a\nb"},
    "expected": "val: \"a\\nb\"",
    "specSection": "7.1",
    "note": "Encoders SHOULD prefer shorthand escapes"
  },
  {
    "name": "decode \\u0004 in quoted string",
    "category": "decode",
    "input": "val: \"a\\u0004b\"",
    "expected": {"val": "a\u0004b"},
    "specSection": "7.1"
  },
  {
    "name": "reject truncated unicode escape",
    "category": "decode",
    "input": "val: \"a\\u00b\"",
    "shouldError": true,
    "specSection": "7.1",
    "note": "\\u must be followed by exactly 4 hex digits"
  }
]
```

### Affected Specification Sections

- **Section 7.1** (Escape Sequences): Add `\uXXXX` to the list of valid escapes, update ABNF grammar
- **Section 1** (Conventions): No change needed, but the new escape uses the same RFC 2119 keywords
- **Appendix / ABNF Grammar**: Update `escaped-char` production

### Unresolved Questions

- **Surrogate pairs:** Should TOON support `\uD800`–`\uDFFF` surrogate pair encoding for characters above U+FFFF (as JSON does via two consecutive `\uXXXX` escapes)? Or should TOON keep it simple and only support BMP code points, requiring literal UTF-8 for supplementary characters? The simpler option (no surrogate pairs) seems preferable since TOON is UTF-8 native and supplementary characters can appear literally.
- **Case sensitivity:** Should `\u001f` and `\u001F` both be accepted? Recommendation: yes, decoders should accept both (case-insensitive hex digits), while encoders SHOULD emit uppercase for consistency.

### Additional Context

- JSON (RFC 8259) defines `\uXXXX` with the same semantics proposed here. This is the most widely understood unicode escape syntax across programming languages.
- The TOON CONTRIBUTING.md already uses "Adding `\u0000` escape sequences" as a canonical example of an RFC-worthy change, suggesting this gap is already on the maintainers' radar.
- This proposal intentionally keeps scope narrow: only `\uXXXX` (4 hex digits, BMP). Extended forms like `\U00XXXXXX` (8 hex digits) or `\u{XXXXX}` (variable-length) are out of scope and can be considered separately if needed.

### Checklist

- [x] I have read the RFC process in CONTRIBUTING.md
- [x] I have searched for similar proposals
- [x] I have considered backward compatibility
- [x] I understand this may require community discussion before acceptance

[RFC]: Add \uXXXX unicode escape sequences for control characters in strings #39

Description

Type of Change

Summary

Motivation

Problem

Use Case

Benefits

Detailed Design

Proposed Syntax

Encoding Rules

Decoding Rules

Grammar Changes (ABNF)

Examples

Encoding a string with U+0004 (EOT)

Round-trip of various control characters

Mixed with existing escapes

Drawbacks

Alternatives Considered

Alternative 1: Do nothing — leave control characters unrepresentable

Alternative 2: Allow raw control characters in quoted strings

Alternative 3: Use a different escape syntax (e.g. \xHH)

Impact on Implementations

Migration Strategy

For Implementers

For Users

Test Cases

Affected Specification Sections

Unresolved Questions

Additional Context

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Alternative 3: Use a different escape syntax (e.g. `\xHH`)