ScribeTokens

Implementation and experiment pipeline for ScribeTokens, the digital-ink tokenization proposed in the paper.

ScribeTokens represents pen trajectories with a fixed base vocabulary of 10 tokens:

8 directional unit-step tokens (Freeman-style chain directions)
2 pen-state tokens ([DOWN], [UP])

The core idea is to decompose stroke segments into unit pixel steps (via Bresenham decomposition), then apply BPE over this base alphabet. This keeps tokenization OOV-free at the base level while still allowing strong compression and stable cross-entropy training.

Quick Start

1) Install dependencies

uv sync

This project expects Python 3.12 (see pyproject.toml).

2) Python path

Scripts import modules from src/.

In VS Code integrated terminal, .vscode/settings.json already sets PYTHONPATH to ${workspaceFolder}/src.
Outside VS Code, set it manually:

export PYTHONPATH="$PWD/src"

3) Select dataset

Update DATASET in:

src/constants.py

Valid values:

"iam"
"deepwriting"

Most scripts read paths from that constant and some preprocess scripts assert a specific dataset value.

Data Setup

IAM On-Line Handwriting Database

Download from the IAM On-Line Handwriting Database:

data/lineStrokes-all.tar.gz — extract into data/iam/raw/lineStrokes/
data/original-xml-all.tar.gz — extract into data/iam/raw/original/
trainset.txt from "training set"
testset_v.txt from "first validation set"
testset_t.txt from "second validation set"
testset_f.txt from "final test set"

Place the raw data so the directory matches:

data/iam/raw/
├── lineStrokes/            # XML files with stroke coordinates (e.g. a01-000u-01.xml)
│   └── **/*.xml
├── original/               # XML files with form metadata + text labels (e.g. a01-000u.xml)
│   └── **/*.xml
├── trainset.txt            # official train split (line ID prefixes)
├── testset_v.txt           # official validation split 1
├── testset_t.txt           # official validation split 2
└── testset_f.txt           # official test split

Then preprocess:

make preprocess-iam

This parses raw XML into data/iam/parsed/*.json and generates standardized split files under data/iam/split/.

DeepWriting Dataset

Download the "Dataset with timestamps" from DeepWriting.

Place the raw data so the directory matches:

data/deepwriting/raw/
└── Iamondb Dataset/        # JSON files with word-level stroke data
    └── **/*.json

Then preprocess:

make preprocess-deepwriting

This parses JSON into data/deepwriting/parsed/*.json and generates random train/val/test splits under data/deepwriting/split/.

Processed directory structure (generated)

After preprocessing, each dataset directory will contain:

data/<dataset>/
├── raw/                    # original downloaded data (see above)
├── parsed/                 # one JSON per sample (generated by preprocessing)
│   └── *.json
└── split/                  # train/val/test ID lists (generated by preprocessing)
    ├── train.txt
    ├── val.txt
    └── test.txt

Project Structure

scribe-tokens/
├── data/
│   ├── iam/                # raw, parsed, and split files for IAM
│   └── deepwriting/        # raw, parsed, and split files for DeepWriting
├── models/                 # exported best weights per dataset/task/repr
├── output/
│   ├── results/            # CSV metrics
│   ├── tables/             # LaTeX tables
│   └── figures/            # PDF figures
├── scripts/
│   ├── preprocess/         # parse raw datasets into parsed JSON + splits
│   ├── train/              # model/tokenizer training entrypoints
│   ├── eval/               # tokenizer + HTR/HTG evaluation
│   ├── plot/               # tables + plotting scripts
│   └── utils/              # project utilities
├── src/                    # core library code (models, tokenizers, loaders)
├── tests/                  # unit tests
├── tokenisers/             # trained tokenizers
└── Makefile

Makefile Commands

All script entrypoints are exposed as make targets. Run from repo root:

make <target>

Target	What it does
Linting
`format`	`ruff format`
`check`	`ruff check --fix`
`format-check`	Runs both `format` and `check`
Preprocessing
`parse-iam`	Parse IAM raw XML into `data/iam/parsed/`
`parse-deepwriting`	Parse DeepWriting JSON into `data/deepwriting/parsed/`
`split-deepwriting`	Generate random train/val/test splits for DeepWriting
`preprocess-iam`	`parse-iam` (splits are generated during parsing)
`preprocess-deepwriting`	`parse-deepwriting` then `split-deepwriting`
Tokenizer
`train-tokenizers`	Train tokenizer families (delta/vocab sweep)
`eval-compression`	Evaluate tokenizer compression metrics
`eval-oov`	Evaluate tokenizer OOV metrics
`eval-tokenizers`	Runs `eval-compression` and `eval-oov`
Training
`train`	Train all default model/task combinations
`train-test`	Quick test run (`--all --test`)
`train-parallel`	Runs `scripts/train/parallel.sh`
Evaluation
`eval`	Evaluate all supported tasks
`eval-htr`	Evaluate HTR only
`eval-htg`	Evaluate HTG only
Plotting
`plot`	Generate all figures and tables
`plot-compression`	Compression figure only
`plot-oov`	OOV figure only
`plot-discretization`	Discretization figure only
`plot-double-descent`	Double-descent figure only
`plot-attention`	Attention visualization figure(s) only
`plot-convergence`	Convergence speedup table only
`plot-results`	Result CSV to LaTeX tables only
`plot-scribe`	Scribe token visualization figure only
`plot-htg`	HTG handwriting sample grid figure only
Utilities
`move-checkpoints`	Move best checkpoint weights into `models/`
`fetch-metrics`	Fetch SwanLab run metrics to CSV
`fetch-compute-time`	Print total compute time from SwanLab runs
`kill`	Kill processes matching `scribe-tokens`
`check-cuda`	Print CUDA availability/device via PyTorch
`tmux`	Open/attach tmux session named `train`

Important note on `train-parallel`

scripts/train/parallel.sh contains a hard-coded project path:

cd /home/ubuntu/projects/scribe-tokens

Update it for your machine before using make train-parallel.

Script Usage

All entrypoints are available as make targets (see table above). The examples below use make where a target exists, and show the underlying command for cases that take arguments.

Preprocessing

# IAM
make preprocess-iam

# DeepWriting
make preprocess-deepwriting

See Data Setup for expected raw directory layouts.

Tokenizer Analysis

# 1) train tokenizer families (delta/vocab sweep)
make train-tokenizers

# 2) evaluate tokenizer quality metrics
make eval-tokenizers

# 3) plot tokenizer metrics
make plot-compression
make plot-oov

Outputs:

output/results/compression.csv
output/results/oov.csv
output/figures/compression.pdf
output/figures/oov.pdf

Deep Learning Training

# Train all default models/tasks
make train

# Quick test mode
make train-test

# Single model (use the underlying command for CLI args)
uv run -m scripts.train.main --task HTR --repr scribe

# Optional overrides
uv run -m scripts.train.main --all --epochs 50 --batch-size 16

CLI options (from scripts/train/main.py):

--all or --task <TASK> (mutually exclusive)
--repr <scribe|point5|rel|text> (required when using --task)
--test
--experiment-name <name>
--epochs <int>
--batch-size <int>

Supported tasks:

HTR, HTG, NTP, HTR_SFT, HTG_SFT

Evaluation

# All supported tasks
make eval

# Individual tasks
make eval-htr
make eval-htg

# One task with args
uv run -m scripts.eval.main --task HTR

Writes CSVs to output/results/ (for example htr.csv, htg.csv).

Tables and Plots

# All figures and tables
make plot

# Individual targets
make plot-compression
make plot-oov
make plot-discretization
make plot-double-descent
make plot-attention
make plot-convergence
make plot-results
make plot-scribe
make plot-htg

# Results table for one task
uv run -m scripts.plot.results --task HTR

Utilities

make move-checkpoints       # move best checkpoint weights into models/
make fetch-metrics          # fetch SwanLab run metrics to output/results/metrics.csv
make fetch-compute-time     # print total compute time from SwanLab runs

Typical Workflow

IAM

# set DATASET="iam" in src/constants.py

make preprocess-iam
make train-tokenizers
make eval-tokenizers
make train
make eval
make plot

DeepWriting

# set DATASET="deepwriting" in src/constants.py

make preprocess-deepwriting
make train-tokenizers
make eval-tokenizers
make train
make eval
make plot

Notes

Run commands from the repository root.
Existing trained models are skipped by default in training; remove saved model files to force retraining.
Some utility/plot scripts assume prior artifacts exist (trained models, result CSVs, metrics CSVs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScribeTokens

Quick Start

1) Install dependencies

2) Python path

3) Select dataset

Data Setup

IAM On-Line Handwriting Database

DeepWriting Dataset

Processed directory structure (generated)

Project Structure

Makefile Commands

Important note on `train-parallel`

Script Usage

Preprocessing

Tokenizer Analysis

Deep Learning Training

Evaluation

Tables and Plots

Utilities

Typical Workflow

IAM

DeepWriting

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
data		data
output		output
scripts		scripts
src		src
tests		tests
tokenisers		tokenisers
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

ScribeTokens

Quick Start

1) Install dependencies

2) Python path

3) Select dataset

Data Setup

IAM On-Line Handwriting Database

DeepWriting Dataset

Processed directory structure (generated)

Project Structure

Makefile Commands

Important note on train-parallel

Script Usage

Preprocessing

Tokenizer Analysis

Deep Learning Training

Evaluation

Tables and Plots

Utilities

Typical Workflow

IAM

DeepWriting

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Important note on `train-parallel`

Packages