A production-grade Python library and CLI that converts data between JSON, YAML, and TOON (Token-Oriented Object Notation) while fully conforming to TOON SPEC v2.0. Perfect for developers and data engineers who need efficient, token-optimized data serialization.
π¦ Current Version: 0.4.0 - YAML support added with optional dependency model! See What's New in v0.4.0 and Performance sections for details.
β Full TOON SPEC v2.0 Compliance - This library implements all examples from the official TOON specification repository, ensuring complete compatibility with the standard.
YAML Support Release (November 2025) - This version adds comprehensive YAML support with a smart optional dependency model:
- π YAML β TOON conversion - Bidirectional conversion with streaming support
- π¦ Optional dependency model - Zero-dependency core, install YAML support only if needed:
pip install toontools[yaml] - π― CLI commands - New
yaml-to-toonandtoon-to-yamlcommands - β‘ High performance - YAML conversion with minimal overhead (2-9%)
- π Design philosophy docs - New
DESIGN_PHILOSOPHY.mdexplaining architectural decisions - β 22 new tests - Comprehensive YAML test coverage
Why Optional Dependencies?
- Lightweight core: Keep
toontoolsdependency-free for JSON β TOON workflows - Install what you need: Only add PyYAML if you need YAML support
- Best of both worlds: Zero-dependency simplicity + extended format support
Previous Release - v0.3.0 (November 2025):
- β‘ Parser: 20-50% faster - Optimized literal parsing, comment removal, and table processing
- π Serializer: Up to 70% faster - Streamlined type checking and container handling
- π’ Utils: 10-15% faster - Improved number parsing and string operations
Backward Compatibility: β 100% compatible with all previous versions - drop-in replacement, no code changes required!
See RELEASE_NOTES.md for complete details and CHANGELOG.md for the full changelog.
The toonpy library provides comprehensive JSON β TOON conversion capabilities:
- Bidirectional conversion between JSON-compatible Python objects and TOON text
- Round-trip preservation - data integrity guaranteed
- Supports all JSON data types (objects, arrays, scalars)
- Handles nested structures of any depth
- LL(1) parser with indentation tracking
- Comment support - inline (
#,//) and block (/* */) comments - ABNF-backed grammar - fully compliant with TOON SPEC v2.0
- Error reporting with line and column numbers
- Smart detection of uniform-object arrays
- Automatic emission of efficient tabular mode (
key[N]{fields}:) - Token savings estimation using
tiktoken(optional) - Configurable modes: auto, compact, readable
- Command-line interface (
toonpy) for file conversion - Validation API for syntax checking
- Streaming helpers for large files
- Formatting tools for code style consistency
- YAML β TOON conversion with optimized performance
- Streaming YAML to TOON for large files
- CLI commands for YAML file conversion
- Full Unicode support and proper type handling
pip install toontoolsOr install a specific version:
pip install toontools==0.4.0π¦ PyPI Package: toontools on PyPI | Latest: v0.3.0
# Clone the repository
git clone https://github.com/shinjidev/toonpy.git
cd toonpy
# Install the package
pip install .
# Or install with optional extras
pip install .[tests] # Include testing dependencies
pip install .[examples] # Include tiktoken for token counting
pip install .[yaml] # Include PyYAML for YAML supportRequirements: Python 3.9+
Core Philosophy: toontools follows a "zero-dependency core" design. The base installation requires no external packages, ensuring fast installs and minimal footprint. Additional format support (YAML, etc.) is available as optional dependencies.
To enable YAML β TOON conversion:
pip install toontools[yaml]
# or
pip install PyYAML>=6.0Why optional? YAML support is opt-in to keep the core library lightweight (~60KB, 0 dependencies). Most users only need JSON β TOON conversion. If you need YAML support, simply install the extra and all YAML functions become available automatically.
from toontools import to_toon, from_toon
# Convert Python object to TOON
data = {
"crew": [
{"id": 1, "name": "Luz", "role": "Light glyph"},
{"id": 2, "name": "Amity", "role": "Abomination strategist"}
],
"active": true,
"ship": {
"name": "Owl House",
"location": "Bonesborough"
}
}
toon_text = to_toon(data, mode="auto")
print(toon_text)
# Output:
# crew[2]{id,name,role}:
# 1,Luz,"Light glyph"
# 2,Amity,"Abomination strategist"
# active: true
# ship:
# name: "Owl House"
# location: Bonesborough
# Convert TOON back to Python object
round_trip = from_toon(toon_text)
assert round_trip == data # β
Perfect round-trip!from toontools import to_toon, from_toon
# JSON β TOON
data = {"name": "Luz", "age": 16, "active": True}
toon = to_toon(data, indent=2, mode="auto")
# TOON β JSON
parsed = from_toon(toon)
assert parsed == datafrom toontools import validate_toon
toon_text = """
crew[2]{id,name}:
1,Luz
2,Amity
"""
is_valid, errors = validate_toon(toon_text, strict=True)
if not is_valid:
for error in errors:
print(f"Error: {error}")from toontools import suggest_tabular
crew = [
{"id": 1, "name": "Luz"},
{"id": 2, "name": "Amity"}
]
suggestion = suggest_tabular(crew)
if suggestion.use_tabular:
print(f"Use tabular format! Estimated savings: {suggestion.estimated_savings} tokens")
print(f"Fields: {suggestion.keys}")from toontools import stream_to_toon
with open("large_data.json", "r") as fin, open("output.toon", "w") as fout:
bytes_written = stream_to_toon(fin, fout, mode="compact")
print(f"Converted {bytes_written} bytes")Convert YAML to TOON:
from toontools import to_toon_from_yaml
yaml_str = """
crew:
- id: 1
name: Luz
role: Magic user
- id: 2
name: Amity
role: Strategist
"""
toon_str = to_toon_from_yaml(yaml_str, mode="auto")
print(toon_str)
# Output:
# crew[2]{id,name,role}:
# 1,Luz,"Magic user"
# 2,Amity,StrategistConvert TOON to YAML:
from toontools import to_yaml_from_toon
toon_str = """
crew[2]{id,name}:
1,Luz
2,Amity
active: true
"""
yaml_str = to_yaml_from_toon(toon_str)
print(yaml_str)
# Output:
# crew:
# - id: 1
# name: Luz
# - id: 2
# name: Amity
# active: trueStream YAML to TOON:
from toontools import stream_yaml_to_toon
with open("data.yaml", "r") as fin, open("output.toon", "w") as fout:
bytes_written = stream_yaml_to_toon(fin, fout, mode="auto")
print(f"Converted {bytes_written} bytes")Note: Requires pip install toontools[yaml] or pip install PyYAML>=6.0
toonpy to --in data.json --out data.toon --mode readable --indent 2toonpy from --in data.toon --out data.json --permissivetoonpy fmt --in data.toon --out data.formatted.toon --mode readabletoonpy yaml-to-toon --in data.yaml --out data.toon --mode autotoonpy toon-to-yaml --in data.toon --out data.yamlNote: YAML commands require pip install toontools[yaml]
Exit Codes:
0- Success2- TOON syntax error3- General error4- I/O error
The library includes comprehensive unit tests, property-based tests, and performance benchmarks:
# Run all tests
pytest
# Run with coverage
pytest --cov=toonpy --cov-report=html
# Run performance benchmarks
pytest tests/test_benchmark.py -v -s
# Run specific test file
pytest tests/test_parser.py -vTest Coverage:
- β Unit tests for parser, serializer, API, and CLI
- β Property-based tests with Hypothesis for round-trip verification
- β Performance benchmarks for speed validation
- β Edge cases: multiline strings, comments, empty containers
- β Error handling and validation
Example Test Output:
============================= test session starts =============================
tests/test_parser.py::test_parse_object_and_array PASSED
tests/test_parser.py::test_parse_table_block PASSED
tests/test_serializer.py::test_round_trip_simple PASSED
tests/test_benchmark.py::test_serialize_small_data PASSED
...
============================== 20+ passed in 3.45s ==============================
toonpy v0.3.0 delivers exceptional performance with major speed improvements across all components. This release represents a comprehensive optimization effort with measurable gains of 20-70% in key operations.
| Component | Key Operation | Improvement | Impact |
|---|---|---|---|
| Parser | Comment-free files | +70% | Dramatically faster parsing when no comments present |
| Parser | Literal parsing | +30-40% | Common values (true, false, null) cached |
| Parser | Overall parsing | +20-50% | Comprehensive optimizations across all operations |
| Serializer | Key serialization paths | +70% | Type checking streamlined |
| Serializer | Container handling | +35-40% | Reduced redundant isinstance() checks |
| Utils | Number parsing | +10-15% | Try/except approach with regex fallback |
| Utils | Row splitting | Significant | String slicing instead of char-by-char building |
| Parallel | Memory usage | Improved | executor.map() for better efficiency |
Run the benchmarks to see real-time performance metrics:
# Run comprehensive benchmark suite
pytest tests/test_benchmark.py -v -s
# Run module-specific benchmarks
python benchmark_optimizations.py # Parser benchmarks
python benchmark_serializer.py # Serializer benchmarks
python benchmark_parallel.py # Parallel module benchmarksTypical Performance (v0.3.0 on modern hardware):
| Operation | Dataset Size | Time | Throughput | vs v0.2.0 |
|---|---|---|---|---|
| Serialize small data | 3 fields | ~0.010 ms | ~100K ops/s | +30% faster |
| Parse small data | 3 fields | ~0.012 ms | ~83K ops/s | +40% faster |
| Serialize tabular | 100 rows | ~0.30 ms | ~3,300 ops/s | ~70% faster |
| Parse tabular | 100 rows | ~1.20 ms | ~830 ops/s | ~40% faster |
| Round-trip | 500 rows | ~8.5 ms | ~118 ops/s | ~40% faster |
| Large file (1000 rows) | 1K records | ~3-4 ms | ~250-330 ops/s | ~50% faster |
| Nested structures | Depth 10 | ~0.25 ms | ~4,000 ops/s | ~170% faster |
| Comment removal | Comment-free | ~0.05 ms | 20K ops/s | ~70% faster |
Performance Characteristics:
- β‘ Blazing fast serialization - Optimized with literal caching and streamlined logic
- π Efficient tabular format - Automatic detection reduces token count by 30-50%
- π Competitive with JSON - Now only 3-5x slower than JSON (vs 7-12x in v0.2.0)
- π Fast round-trips - Complete JSON β TOON β JSON conversion in single-digit milliseconds
- πΎ Token savings - Tabular format ideal for LLM applications
- π― Production-ready - Optimized for real-world workloads
Example Benchmark Output (v0.3.0):
[Benchmark] Small data serialization: 0.010 ms/op (30% faster)
[Benchmark] Small data parsing: 0.012 ms/op (40% faster)
[Benchmark] Tabular data serialization (100 rows): 0.300 ms (70% faster)
[Benchmark] Tabular data parsing (100 rows): 1.200 ms (40% faster)
[Benchmark] Round-trip (500 rows): 8.500 ms (40% faster)
[Benchmark] Performance comparison (100 rows):
JSON: 0.080 ms
TOON: 0.350 ms (v0.3.0)
Ratio: 4.37x (vs 7.41x in v0.2.0)
The v0.3.0 release includes comprehensive optimizations across all modules. Below are the key improvements:
What was done:
- Implemented
_LITERAL_CACHEdictionary for frequently used tokens - Pre-stores parsed values for
"true","false","null","[]","{}" - Early return pattern in
_parse_token()to check cache first
Why it's faster:
- Before: Every literal required string processing, type detection, and conversion
- After: Common literals return cached value instantly, skipping all parsing logic
- Impact: Massive speedup for files with many boolean/null values
Code example:
# Before (slow):
if token.lower() == "true":
return True
elif token.lower() == "false":
return False
# ... more checks
# After (fast):
cached = _LITERAL_CACHE.get(token.lower())
if token.lower() in _LITERAL_CACHE:
return cached # Instant returnWhat was done:
- Refactored
_remove_block_comments()to useio.StringIO - Added early return if no block comments detected
- Eliminated character-by-character string building
Why it's faster:
- Before: Always processed entire file character-by-character, building result with string concatenation
- After: Early exit if no
/*found, uses efficientStringIOwhen needed - Impact: Most TOON files have no block comments, so they skip processing entirely
What was done:
- Changed
guess_number()to use try/except forint()andfloat() - Regex used only for strict validation, not primary parsing
- Early rejection based on first character
Why it's faster:
- Before: Regex pattern matching for every number, which is relatively slow
- After: Native Python int/float conversion (fast path), regex only for edge cases
- Impact: Number-heavy files parse significantly faster
What was done:
- Optimized
_inline_container_repr()to minimizeisinstance()calls - Removed redundant type checks in
_write_value() - Better code flow to avoid repeated checks
Why it's faster:
- Before: Multiple
isinstance()checks for same object - After: Check once, remember result, use efficient logic flow
- Impact: Especially noticeable when serializing many objects
What was done:
- Replaced character-by-character list building in
split_escaped_row() - Used efficient string slicing to extract segments
- Eliminated intermediate list and
join()overhead
Why it's faster:
- Before: Loop through each char, append to list, join at end
- After: Slice string directly at split points
- Impact: Much faster for tabular data with many rows
What was done:
- Implemented a cache for indentation strings (0-20 levels)
- Pre-computes common indentation strings instead of creating them repeatedly
- Uses
_get_indent()method with_indent_cachedictionary
Why it's faster:
- Before: Each line required creating a new string with
" " * (level * indent), which allocates memory and performs string multiplication repeatedly - After: Common indentation levels are computed once and reused, eliminating redundant string creation
- Impact: Most noticeable in deeply nested structures where the same indentation levels are used many times
Code example:
# Before (slow):
lines.append(" " * level + content) # Creates new string every time
# After (fast):
indent_str = self._get_indent(level) # Uses cache
lines.append(indent_str + content)What was done:
- Eliminated string concatenation with
+operator in loops - Pre-compute common prefixes (like
"-"for arrays) - Use
join()once at the end instead of multiple concatenations - Build rows as lists and join once per row
Why it's faster:
- Before: Python's
+operator for strings creates new string objects each time, which is O(n) for each concatenation - After: Building a list and using
join()is O(n) total, much more efficient - Impact: Especially noticeable in tabular format where many rows are processed
Code example:
# Before (slow):
row = ""
for cell in cells:
row += cell + "," # Creates new string each iteration
# After (fast):
row_str = ",".join(cells) # Single join operationWhat was done:
- Compiled regex patterns as class attributes instead of compiling them on each call
- Patterns are compiled once when the class is defined, not per instance
Why it's faster:
- Before:
re.match(pattern, text)compiles the pattern every time it's called - After: Pre-compiled patterns stored as
_QUOTED_TABLE_PATTERNand_UNQUOTED_TABLE_PATTERNare reused - Impact: Most noticeable when parsing many table headers
Code example:
# Before (slow):
match = re.match(r'^"([^"]+)"\[(\d+)\]\{([^}]+)\}:$', content)
# After (fast):
match = self._QUOTED_TABLE_PATTERN.match(content) # Pre-compiledWhat was done:
- Only normalize line endings if
\ris present in the source - Avoids unnecessary string operations on Unix-style text
Why it's faster:
- Before: Always performed
replace("\r\n", "\n").replace("\r", "\n")even when not needed - After: Checks for
\rfirst, only normalizes if necessary - Impact: Small but consistent improvement, especially for large files
What was done:
- Created
toonpy.parallelmodule withparallel_serialize_chunks() - Uses
concurrent.futures(ThreadPoolExecutor or ProcessPoolExecutor) - Allows processing large arrays in parallel chunks
Why it's faster:
- Before: Large arrays processed sequentially on a single core
- After: Arrays divided into chunks, each processed in parallel
- Impact: Significant speedup for very large datasets (>10K elements) on multi-core systems
Usage:
from toonpy.parallel import parallel_serialize_chunks, chunk_sequence
from toonpy import ToonSerializer
large_array = [{"id": i} for i in range(50000)]
chunks = chunk_sequence(large_array, chunk_size=5000)
serializer = ToonSerializer()
results = parallel_serialize_chunks(
chunks,
serializer.dumps,
use_threads=False, # Use processes for CPU-bound work
max_workers=4
)| Optimization | Improvement | Best For | Version |
|---|---|---|---|
| Literal Caching | 30-40% | Files with many booleans/nulls | v0.3.0 |
| StringIO Comment Removal | 70% | Comment-free files (most common) | v0.3.0 |
| Try/Except Number Parsing | 10-15% | Number-heavy data | v0.3.0 |
| Streamlined Type Checking | 35-40% | Object serialization | v0.3.0 |
| String Slicing Row Parsing | Significant | Tabular data with many rows | v0.3.0 |
| Indentation Caching | 15-20% | Nested structures, deep hierarchies | v0.2.0 |
| String Concatenation | 5-10% general, 60% tabular | Tabular arrays, large datasets | v0.2.0 |
| Compiled Regex | 3-5% | Table parsing, repeated patterns | v0.2.0 |
| Line Ending Optimization | 1-2% | Large files, Unix-style text | v0.2.0 |
| Parallelism | 2-4x | Arrays >10K elements | v0.2.0 |
Overall Impact (v0.3.0 vs v0.2.0):
- Parser: 20-50% faster overall, 70% faster for comment-free files
- Serializer: Up to 70% faster in key paths, 35-40% faster container handling
- Utils: 10-15% faster number parsing, significant row parsing improvement
- Tabular serialization: ~70% faster (0.30 ms vs 0.55 ms)
- Tabular parsing: ~40% faster (1.20 ms vs 1.70 ms)
- Round-trip: ~40% faster (8.5 ms vs 11.9 ms)
- Nested structures: ~170% faster throughput (4,000 ops/s vs 2,300 ops/s)
v0.3.0 vs v0.1.0 (Initial Release):
- Parser: ~100-150% faster (2-2.5x speedup)
- Serializer: ~200% faster (3x speedup)
- Overall throughput: ~140% improvement
These optimizations maintain full TOON SPEC v2.0 compliance while dramatically improving performance. All improvements are production-tested with 24/24 tests passing.
π Detailed Documentation:
- RELEASE_NOTES.md - Complete v0.3.0 release notes
- OPTIMIZATIONS_DOCUMENTED.md - 23-page technical analysis
- ALL_OPTIMIZATIONS_SUMMARY.md - Comprehensive overview
- Run
benchmark_optimizations.py,benchmark_serializer.py,benchmark_parallel.pyfor detailed metrics
Input JSON:
{
"crew": [
{"id": 1, "name": "Luz", "role": "Light glyph"},
{"id": 2, "name": "Amity", "role": "Abomination strategist"}
],
"active": true,
"ship": {
"name": "Owl House",
"location": "Bonesborough"
}
}Output TOON (auto mode):
crew[2]{id,name,role}:
1,Luz,"Light glyph"
2,Amity,"Abomination strategist"
active: true
ship:
name: "Owl House"
location: Bonesborough
Token Savings: The tabular format (crew[2]{id,name,role}:) reduces token count by ~40% compared to standard JSON array format!
Convert a Python object to TOON format string.
Parameters:
obj(Any): Python object compatible with JSON modelindent(int): Number of spaces per indentation level (default: 2)mode(str): Serialization mode -"auto","compact", or"readable"
Returns: str - TOON-formatted string
Example:
data = {"name": "Luz", "active": True}
toon = to_toon(data, mode="auto")Parse a TOON string into a Python object.
Parameters:
source(str): TOON-formatted string to parsemode(str): Parsing mode -"strict"or"permissive"
Returns: Any - Python object (dict, list, or scalar)
Raises: ToonSyntaxError if TOON string is malformed
Example:
toon = 'name: "Luz"\nactive: true'
data = from_toon(toon)Validate a TOON string for syntax errors.
Parameters:
source(str): TOON-formatted string to validatestrict(bool): If True, use strict parsing mode
Returns: tuple[bool, List[ValidationError]] - (is_valid, list_of_errors)
Suggest whether an array should use tabular format.
Parameters:
obj(Sequence): Sequence to analyze
Returns: TabularSuggestion - Recommendation with estimated savings
Stream JSON from input file to TOON output file.
Parameters:
fileobj_in(TextIO): Input file object containing JSONfileobj_out(TextIO): Output file object for TOONchunk_size(int): Size of chunks to read (default: 65536)indent(int): Indentation levelmode(str): Serialization mode
Returns: int - Number of bytes written
Raised when TOON input does not conform to the grammar.
Attributes:
message(str): Error messageline(int | None): Line number (1-indexed)column(int | None): Column number (1-indexed)
Example:
try:
data = from_toon("invalid syntax")
except ToonSyntaxError as e:
print(f"Error at line {e.line}, column {e.column}: {e.message}")- Python >= 3.9
- No external dependencies (pure Python)
- Optional:
tiktoken >= 0.5.2for token counting (install withpip install .[examples])
Comprehensive documentation is available in the repository:
docs/spec_summary.mdβ Concise TOON SPEC v2.0 overview with ABNF notesdocs/examples.mdβ JSONβTOON conversion examplesdocs/assumptions.mdβ Documented gaps/assumptions + strict vs. permissive behaviorDESIGN_PHILOSOPHY.mdβ Architecture decisions and design principles (why zero-dependency core, optional features, etc.)
RELEASE_NOTES.mdβ Complete v0.3.0 release notes with upgrade guideCHANGELOG.mdβ Traditional changelog with version historyYAML_SUPPORT_SUMMARY.mdβ Complete YAML support implementation details
OPTIMIZATION_README.mdβ Quick start guide to optimization docsOPTIMIZATIONS_DOCUMENTED.mdβ 23-page detailed technical analysisALL_OPTIMIZATIONS_SUMMARY.mdβ Comprehensive optimization overviewSERIALIZER_OPTIMIZATIONS.mdβ Serializer-specific optimizationsUTILS_OPTIMIZATIONS.mdβ Utils module improvementsPARALLEL_OPTIMIZATIONS.mdβ Parallel processing enhancementsOPTIMIZATION_PROJECT_SUMMARY.mdβ Executive summary of optimization project
benchmark_optimizations.pyβ Parser performance benchmarksbenchmark_serializer.pyβ Serializer performance benchmarksbenchmark_parallel.pyβ Parallel module benchmarksbenchmark_summary.pyβ Visual benchmark summary generator
Note: Tabular format heuristics are documented in the code (see toonpy/serializer.py and toonpy/utils.py). The library automatically detects uniform arrays and uses tabular format when it saves tokens.
- Data Serialization: Efficient storage and transmission of structured data
- API Development: Lightweight data format for REST APIs
- Configuration Files: Human-readable config format with comments support
- Data Pipelines: Stream processing of large JSON datasets
- ML/AI Projects: Token-optimized format for LLM training data
- Documentation: Self-documenting data format with inline comments
This library includes comprehensive examples covering all use cases from the official TOON specification examples. Check out the examples/ directory:
example1- Basic tabular array with nested objectsexample2- Nested objects with arraysexample3- Mixed array typesexample4- Multiline stringsexample5- Empty containers and scalarsexample6- Large tabular arraysexample7- Complex nested structuresexample8- Deep nesting examples
All examples are compatible with the official TOON specification and can be validated against the reference implementation.
Try them with the CLI:
toonpy to --in examples/example1.json --out examples/example1.generated.toon
toonpy from --in examples/example1.toon --out examples/example1.generated.jsonContributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Guidelines:
- Follow PEP 8 style guidelines
- Add tests for new features
- Update documentation as needed
- Ensure all tests pass:
pytest - Keep additions aligned with TOON SPEC v2.0
This project is licensed under the MIT License - see the LICENSE file for details.
Christian Palomares - @shinjidev
If you find this project helpful, consider supporting my work:
Buy me a coffee to help me continue developing open-source tools for the developer community!
- Built following TOON SPEC v2.0
- Inspired by the need for efficient, token-optimized data serialization
- Uses property-based testing with Hypothesis for robust validation
β Star this repository if you find it useful! β
A production-grade Python library and CLI that converts data between JSON and TOON (Token-Oriented Object Notation) while fully conforming to TOON SPEC v2.0.