Implementation and experiment pipeline for ScribeTokens, the digital-ink tokenization proposed in the paper.
📄 Paper | 📦 Library (tokink)
ScribeTokens represents pen trajectories with a fixed base vocabulary of 10 tokens:
- 8 directional unit-step tokens (Freeman-style chain directions)
- 2 pen-state tokens (
[DOWN],[UP])
The core idea is to decompose stroke segments into unit pixel steps (via Bresenham decomposition), then apply BPE over this base alphabet. This keeps tokenization OOV-free at the base level while still allowing strong compression and stable cross-entropy training.
uv syncThis project expects Python 3.12 (see pyproject.toml).
Scripts import modules from src/.
- In VS Code integrated terminal,
.vscode/settings.jsonalready setsPYTHONPATHto${workspaceFolder}/src. - Outside VS Code, set it manually:
export PYTHONPATH="$PWD/src"Update DATASET in:
src/constants.py
Valid values:
"iam""deepwriting"
Most scripts read paths from that constant and some preprocess scripts assert a specific dataset value.
Download from the IAM On-Line Handwriting Database:
data/lineStrokes-all.tar.gz— extract intodata/iam/raw/lineStrokes/data/original-xml-all.tar.gz— extract intodata/iam/raw/original/trainset.txtfrom "training set"testset_v.txtfrom "first validation set"testset_t.txtfrom "second validation set"testset_f.txtfrom "final test set"
Place the raw data so the directory matches:
data/iam/raw/
├── lineStrokes/ # XML files with stroke coordinates (e.g. a01-000u-01.xml)
│ └── **/*.xml
├── original/ # XML files with form metadata + text labels (e.g. a01-000u.xml)
│ └── **/*.xml
├── trainset.txt # official train split (line ID prefixes)
├── testset_v.txt # official validation split 1
├── testset_t.txt # official validation split 2
└── testset_f.txt # official test split
Then preprocess:
make preprocess-iamThis parses raw XML into data/iam/parsed/*.json and generates standardized split files under data/iam/split/.
Download the "Dataset with timestamps" from DeepWriting.
Place the raw data so the directory matches:
data/deepwriting/raw/
└── Iamondb Dataset/ # JSON files with word-level stroke data
└── **/*.json
Then preprocess:
make preprocess-deepwritingThis parses JSON into data/deepwriting/parsed/*.json and generates random train/val/test splits under data/deepwriting/split/.
After preprocessing, each dataset directory will contain:
data/<dataset>/
├── raw/ # original downloaded data (see above)
├── parsed/ # one JSON per sample (generated by preprocessing)
│ └── *.json
└── split/ # train/val/test ID lists (generated by preprocessing)
├── train.txt
├── val.txt
└── test.txt
scribe-tokens/
├── data/
│ ├── iam/ # raw, parsed, and split files for IAM
│ └── deepwriting/ # raw, parsed, and split files for DeepWriting
├── models/ # exported best weights per dataset/task/repr
├── output/
│ ├── results/ # CSV metrics
│ ├── tables/ # LaTeX tables
│ └── figures/ # PDF figures
├── scripts/
│ ├── preprocess/ # parse raw datasets into parsed JSON + splits
│ ├── train/ # model/tokenizer training entrypoints
│ ├── eval/ # tokenizer + HTR/HTG evaluation
│ ├── plot/ # tables + plotting scripts
│ └── utils/ # project utilities
├── src/ # core library code (models, tokenizers, loaders)
├── tests/ # unit tests
├── tokenisers/ # trained tokenizers
└── Makefile
All script entrypoints are exposed as make targets. Run from repo root:
make <target>| Target | What it does |
|---|---|
| Linting | |
format |
ruff format |
check |
ruff check --fix |
format-check |
Runs both format and check |
| Preprocessing | |
parse-iam |
Parse IAM raw XML into data/iam/parsed/ |
parse-deepwriting |
Parse DeepWriting JSON into data/deepwriting/parsed/ |
split-deepwriting |
Generate random train/val/test splits for DeepWriting |
preprocess-iam |
parse-iam (splits are generated during parsing) |
preprocess-deepwriting |
parse-deepwriting then split-deepwriting |
| Tokenizer | |
train-tokenizers |
Train tokenizer families (delta/vocab sweep) |
eval-compression |
Evaluate tokenizer compression metrics |
eval-oov |
Evaluate tokenizer OOV metrics |
eval-tokenizers |
Runs eval-compression and eval-oov |
| Training | |
train |
Train all default model/task combinations |
train-test |
Quick test run (--all --test) |
train-parallel |
Runs scripts/train/parallel.sh |
| Evaluation | |
eval |
Evaluate all supported tasks |
eval-htr |
Evaluate HTR only |
eval-htg |
Evaluate HTG only |
| Plotting | |
plot |
Generate all figures and tables |
plot-compression |
Compression figure only |
plot-oov |
OOV figure only |
plot-discretization |
Discretization figure only |
plot-double-descent |
Double-descent figure only |
plot-attention |
Attention visualization figure(s) only |
plot-convergence |
Convergence speedup table only |
plot-results |
Result CSV to LaTeX tables only |
plot-scribe |
Scribe token visualization figure only |
plot-htg |
HTG handwriting sample grid figure only |
| Utilities | |
move-checkpoints |
Move best checkpoint weights into models/ |
fetch-metrics |
Fetch SwanLab run metrics to CSV |
fetch-compute-time |
Print total compute time from SwanLab runs |
kill |
Kill processes matching scribe-tokens |
check-cuda |
Print CUDA availability/device via PyTorch |
tmux |
Open/attach tmux session named train |
scripts/train/parallel.sh contains a hard-coded project path:
cd /home/ubuntu/projects/scribe-tokensUpdate it for your machine before using make train-parallel.
All entrypoints are available as make targets (see table above). The examples below use make where a target exists, and show the underlying command for cases that take arguments.
# IAM
make preprocess-iam
# DeepWriting
make preprocess-deepwritingSee Data Setup for expected raw directory layouts.
# 1) train tokenizer families (delta/vocab sweep)
make train-tokenizers
# 2) evaluate tokenizer quality metrics
make eval-tokenizers
# 3) plot tokenizer metrics
make plot-compression
make plot-oovOutputs:
output/results/compression.csvoutput/results/oov.csvoutput/figures/compression.pdfoutput/figures/oov.pdf
# Train all default models/tasks
make train
# Quick test mode
make train-test
# Single model (use the underlying command for CLI args)
uv run -m scripts.train.main --task HTR --repr scribe
# Optional overrides
uv run -m scripts.train.main --all --epochs 50 --batch-size 16CLI options (from scripts/train/main.py):
--allor--task <TASK>(mutually exclusive)--repr <scribe|point5|rel|text>(required when using--task)--test--experiment-name <name>--epochs <int>--batch-size <int>
Supported tasks:
HTR,HTG,NTP,HTR_SFT,HTG_SFT
# All supported tasks
make eval
# Individual tasks
make eval-htr
make eval-htg
# One task with args
uv run -m scripts.eval.main --task HTRWrites CSVs to output/results/ (for example htr.csv, htg.csv).
# All figures and tables
make plot
# Individual targets
make plot-compression
make plot-oov
make plot-discretization
make plot-double-descent
make plot-attention
make plot-convergence
make plot-results
make plot-scribe
make plot-htg
# Results table for one task
uv run -m scripts.plot.results --task HTRmake move-checkpoints # move best checkpoint weights into models/
make fetch-metrics # fetch SwanLab run metrics to output/results/metrics.csv
make fetch-compute-time # print total compute time from SwanLab runs# set DATASET="iam" in src/constants.py
make preprocess-iam
make train-tokenizers
make eval-tokenizers
make train
make eval
make plot# set DATASET="deepwriting" in src/constants.py
make preprocess-deepwriting
make train-tokenizers
make eval-tokenizers
make train
make eval
make plot- Run commands from the repository root.
- Existing trained models are skipped by default in training; remove saved model files to force retraining.
- Some utility/plot scripts assume prior artifacts exist (trained models, result CSVs, metrics CSVs).