minemizer is a format that is focused on representing data using the least amount of tokens (up to 4x gains!) and highest LLM accuracy possible. It is csv-like, but supports sparse and nested data. Minimal and also human readable.
More interactive benchmarks can be found here: https://ashirviskas.github.io/
csv-like
from minemizer import minemize
data = [
{"name": "Marta", "role": "Engineer", "team": "Backend"},
{"name": "James", "role": "Designer", "team": "Frontend"},
{"name": "Sophie", "role": "Manager", "team": "Product"},
]
print(minemize(data))name; role; team
Marta; Engineer; Backend
James; Designer; Frontend
Sophie; Manager; Product
data = [
{"id": 1, "name": "Yuki", "address": {"street": "12 Sakura Lane", "city": "Kyoto"}},
{"id": 2, "name": "Lin", "address": {"street": "88 Garden Road", "city": "Taipei"}},
]
print(minemize(data))id; name; address{ street; city}
1; Yuki;{ 12 Sakura Lane; Kyoto}
2; Lin;{ 88 Garden Road; Taipei}
Control how sparse fields are handled using sparsity_threshold (default=0.5).
data = [
{"id": 1, "name": "Lukas", "location": {"city": "Vilnius", "floor": 3}},
{"id": 2, "name": "Emma", "location": {"city": "Boston", "floor": 7, "desk": "A12"}},
{"id": 3, "name": "Yuki", "location": {"city": "Tokyo", "floor": 5}},
{"id": 4, "name": "Oliver", "location": {"city": "London", "floor": 2, "desk": "B04"}},
]
# Default (0.5): desk appears in 50% of records, included in schema
print(minemize(data))
# Very high sparsity threshold (sparse values schema appears in data rows)
print(minemize(data, sparsity_threshold=1.0))default (0.5) sparsity_threshold:
id; name; location{ city; floor; desk}
1; Lukas;{ Vilnius; 3;}
2; Emma;{ Boston; 7; A12}
3; Yuki;{ Tokyo; 5;}
4; Oliver;{ London; 2; B04}
----------
strict (1.0) sparsity_threshold: only fields in ALL records go in schema, "desk" becomes sparse
id; name; location{ city; floor; ...}
1; Lukas;{ Vilnius; 3}
2; Emma;{ Boston; 7; desk: A12}
3; Yuki;{ Tokyo; 5}
4; Oliver;{ London; 2; desk: B04}
tl;dr:
- Original data size (JSON pretty): 763 chars | 312.8 tokens | 2.4 chars/token
- minemizer: 251 chars | 75.8 tokens | 10.1 og chars/token
- toon: 246 chars | 97.2 tokens | 7.8 og chars/token
- Original data size (JSON pretty): 1039 chars | 430.2 tokens | 2.4 chars/token
- minemizer: 325 chars | 124.5 tokens | 8.3 og chars/token
- toon: 675 chars | 249.8 tokens | 4.2 og chars/token
- up to 4x token savings (~1.5x on average)
- LLMs handle more data with the same token budget
- Most efficient for token usage among tested
- Human readable
- Simple format - basically CSV when data is flat
- Simple implementation with no dependencies (core is <500 LoCs)
- Can increase data comprehension and retrieval accuracy (YAML won in some cases, but at a much higher token usage and within the margin of error)
- Flexible
- No regex in the core, so the code is super readable too!
| Format | Chars | gpt2 | llama | qwen2.5 | Deepseek-V3.2 | Avg Tokens | Orig/Token |
|---|---|---|---|---|---|---|---|
| JSON (pretty) | 763 | 384 | 334 | 264 | 269 | 312.8 | 2.4 |
| JSON (min) | 522 | 152 | 165 | 137 | 149 | 150.8 | 5.1 |
| CSV | 234 | 95 | 101 | 77 | 90 | 90.8 | 8.4 |
| TSV | 234 | 95 | 101 | 77 | 91 | 91.0 | 8.4 |
| YAML | 489 | 163 | 180 | 169 | 171 | 170.8 | 4.5 |
| TOON | 246 | 98 | 103 | 96 | 92 | 97.2 | 7.8 |
| TSON | 229 | 90 | 95 | 80 | 85 | 87.5 | 8.7 |
| minemizer | 251 | 74 | 83 | 72 | 74 | 75.8 | 10.1 |
| minemizer (compact) | 224 | 85 | 91 | 77 | 82 | 83.8 | 9.1 |
See interactive benchmarks for detailed tokenization and accuracy comparison across different tokenizers and LLMs.
Simple uv add minemizer or pip install minemizer or poetry add minemizer
pip install git+https://github.com/ashirviskas/minemizer.gituv add git+https://github.com/ashirviskas/minemizer.gitpoetry add git+https://github.com/ashirviskas/minemizer.gitSet global defaults or use per-call overrides:
from minemizer import config, minemize
# Configure globally
config.delimiter = "|"
config.use_spaces = False
data = [{"a": 1, "b": 2}]
print(minemize(data)) # a|b \n 1|2
# Override per-call
print(minemize(data, delimiter=",")) # a,b \n 1,2| Option | Default | Description |
|---|---|---|
delimiter |
";" |
Field separator |
use_spaces |
True |
Add space after delimiter |
sparsity_threshold |
0.5 |
Key frequency threshold for header (0.0-1.0) |
sparse_indicator |
"..." |
Indicator for sparse fields in schema |
header_separator |
None |
Separator row after header (e.g., "---") |
wrap_lines |
None |
Wrap each line with this string (e.g., "|") |
I added some presets for fun if you want your data to look more like something else that might help your LLM understand it better while still keeping some minemizer optimizations. It does not guarantee the format will be compliant, but hey, at least it looks like it.
from minemizer import minemize, presetsIf you cannot tell the difference, does it really matter?
print(minemize(data, preset=presets.csv))name,role,team
Marta,Engineer,Backend
James,Designer,Frontend
Sophie,Manager,Product
Works all the time, 75% of the time (don't try nested pls)
print(minemize(data, preset=presets.markdown))|name| role| team|
|---| ---| ---|
|Marta| Engineer| Backend|
|James| Designer| Frontend|
|Sophie| Manager| Product|
Rendered:
| name | role | team |
|---|---|---|
| Marta | Engineer | Backend |
| James | Designer | Frontend |
| Sophie | Manager | Product |
| Preset | Description |
|---|---|
presets.default / presets.llm |
Optimized for LLM token efficiency (semicolon, spaces) |
presets.markdown |
Proper markdown table with header separator |
presets.csv |
Comma-separated values |
presets.tsv |
Tab-separated values |
presets.compact |
Minimal characters (like default, just no spaces) |
See examples/ for more detailed examples.
Last updated: 2025-12-01
Normalized comparison (JSON pretty = 1.0x):
| Format | flat | nested | lists | sparse | complex | books | countries | large_mixed | large_numerical | large_text | mcp_tools | avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| JSON (pretty) | 1.0x | 1.0x | 1.0x | 1.0x | 1.0x | 1.0x | 1.0x | 1.0x | 1.0x | 1.0x | 1.0x | 1.0x |
| JSON (min) | 2.1x | 2.3x | 2.4x | 2.0x | 2.2x | 1.5x | 1.5x | 2.1x | 1.7x | 1.7x | 2.3x | 2.0x |
| CSV | 3.4x | ✗ | ✗ | ✗ | ✗ | 2.0x | ✗ | ✗ | ✗ | ✗ | ✗ | 2.7x** |
| TSV | 3.4x | ✗ | ✗ | ✗ | ✗ | 2.0x | ✗ | ✗ | ✗ | ✗ | ✗ | 2.7x** |
| YAML | 1.8x | 1.8x | 1.8x | 1.8x | 1.7x | 1.3x | 2.1x | 1.7x | 1.4x | 1.5x | 1.5x | 1.7x |
| TOON | 3.2x | 1.7x | 1.9x | 1.6x | 1.6x | 2.0x | 2.0x | 1.5x | 1.3x | 1.5x | 1.5x | 1.8x |
| TSON | 3.6x | 3.4x | 3.7x | 2.0x | 2.6x | 2.0x | 2.9x | 1.9x | 1.7x | 1.6x | 2.4x | 2.5x |
| minemizer | 4.1x | 3.5x | 3.7x | 3.6x | 3.1x | 2.0x | 3.7x | 2.4x | 1.8x | 2.2x | 2.9x | 3.0x |
| minemizer (compact) | 3.7x | 3.4x | 3.6x | 3.3x | 3.0x | 2.1x | 3.6x | 2.4x | 1.9x | 2.1x | 2.9x | 2.9x |
Higher is better. ✗ = format cannot represent this data type. ** = average from partial data.
See interactive benchmarks or markdown for detailed comparison across different tokenizers and LLMs.
# Install benchmark dependencies
uv sync --group benchmark
# Run compression benchmarks (token efficiency)
uv run python -m benchmarks compression
# Generate synthetic data for LLM benchmarks
uv run python -m benchmarks generate --sizes 50,100,1000,5000
# Run LLM accuracy benchmarks (requires local llama.cpp server)
uv run python -m benchmarks llm --model "your-model" --data nested_1000 --queries 50
# Generate HTML report from LLM results
uv run python -m benchmarks report --include-all- Delimiter:
;- Chosen mostly arbitrarily as it is not used too often in text data, but is used often enough to be recognized as a separator by LLMs. - Use spaces:
True- Renders strings as{ somevalue; othervalue}instead of{somevalue;othervalue}for better tokenization efficiency. It does introduce more tokens on average (~3-5% in my testing), but more the tokens more often preserve whole words. Example{Hana;pyramid}will tokenize to{|H|ana|;p|yramid}(5 tokens and words are split), while{ Hana; pyramid}tokenizes to{| Hana|;| pyramid|}(still 5 tokens, but the words are preserved). This will not matter much for bigger LLMs, but for smaller models it can make a difference. If you use a model that is 100B+ parameters, you can probably set this toFalseand save some tokens. Real benchmarks are more than welcome. - Sparsity threshold:
0.5- If some value appears in less than 50% of records, it becomes sparse.
- Not battle tested
- Not a standard format
- Standard not finalized yet
- Cannot convert the data back to the original format (no parser implementation)
- Deal with auto formatting numbers (floats, i.e. do python
{number:.5g}maybe as optional), dates (ISO8601 FTW, LLMs do like it very much) etc. - Create presets for different LLM tokenizers/models to maximize token efficiency (less tokens) and/or performance (better benchmarks)
- Support for type hints to optimize formatting (e.g., dates, numbers)
- Per field configuration (custom date format, number precision, unix to datetime etc.)
PRs are very welcome!