Skip to content

douglasswng/scribe-tokens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScribeTokens

Implementation and experiment pipeline for ScribeTokens, the digital-ink tokenization proposed in the paper.

📄 Paper | 📦 Library (tokink)

ScribeTokens represents pen trajectories with a fixed base vocabulary of 10 tokens:

  • 8 directional unit-step tokens (Freeman-style chain directions)
  • 2 pen-state tokens ([DOWN], [UP])

The core idea is to decompose stroke segments into unit pixel steps (via Bresenham decomposition), then apply BPE over this base alphabet. This keeps tokenization OOV-free at the base level while still allowing strong compression and stable cross-entropy training.

Quick Start

1) Install dependencies

uv sync

This project expects Python 3.12 (see pyproject.toml).

2) Python path

Scripts import modules from src/.

  • In VS Code integrated terminal, .vscode/settings.json already sets PYTHONPATH to ${workspaceFolder}/src.
  • Outside VS Code, set it manually:
export PYTHONPATH="$PWD/src"

3) Select dataset

Update DATASET in:

  • src/constants.py

Valid values:

  • "iam"
  • "deepwriting"

Most scripts read paths from that constant and some preprocess scripts assert a specific dataset value.

Data Setup

IAM On-Line Handwriting Database

Download from the IAM On-Line Handwriting Database:

  • data/lineStrokes-all.tar.gz — extract into data/iam/raw/lineStrokes/
  • data/original-xml-all.tar.gz — extract into data/iam/raw/original/
  • trainset.txt from "training set"
  • testset_v.txt from "first validation set"
  • testset_t.txt from "second validation set"
  • testset_f.txt from "final test set"

Place the raw data so the directory matches:

data/iam/raw/
├── lineStrokes/            # XML files with stroke coordinates (e.g. a01-000u-01.xml)
│   └── **/*.xml
├── original/               # XML files with form metadata + text labels (e.g. a01-000u.xml)
│   └── **/*.xml
├── trainset.txt            # official train split (line ID prefixes)
├── testset_v.txt           # official validation split 1
├── testset_t.txt           # official validation split 2
└── testset_f.txt           # official test split

Then preprocess:

make preprocess-iam

This parses raw XML into data/iam/parsed/*.json and generates standardized split files under data/iam/split/.

DeepWriting Dataset

Download the "Dataset with timestamps" from DeepWriting.

Place the raw data so the directory matches:

data/deepwriting/raw/
└── Iamondb Dataset/        # JSON files with word-level stroke data
    └── **/*.json

Then preprocess:

make preprocess-deepwriting

This parses JSON into data/deepwriting/parsed/*.json and generates random train/val/test splits under data/deepwriting/split/.

Processed directory structure (generated)

After preprocessing, each dataset directory will contain:

data/<dataset>/
├── raw/                    # original downloaded data (see above)
├── parsed/                 # one JSON per sample (generated by preprocessing)
│   └── *.json
└── split/                  # train/val/test ID lists (generated by preprocessing)
    ├── train.txt
    ├── val.txt
    └── test.txt

Project Structure

scribe-tokens/
├── data/
│   ├── iam/                # raw, parsed, and split files for IAM
│   └── deepwriting/        # raw, parsed, and split files for DeepWriting
├── models/                 # exported best weights per dataset/task/repr
├── output/
│   ├── results/            # CSV metrics
│   ├── tables/             # LaTeX tables
│   └── figures/            # PDF figures
├── scripts/
│   ├── preprocess/         # parse raw datasets into parsed JSON + splits
│   ├── train/              # model/tokenizer training entrypoints
│   ├── eval/               # tokenizer + HTR/HTG evaluation
│   ├── plot/               # tables + plotting scripts
│   └── utils/              # project utilities
├── src/                    # core library code (models, tokenizers, loaders)
├── tests/                  # unit tests
├── tokenisers/             # trained tokenizers
└── Makefile

Makefile Commands

All script entrypoints are exposed as make targets. Run from repo root:

make <target>
Target What it does
Linting
format ruff format
check ruff check --fix
format-check Runs both format and check
Preprocessing
parse-iam Parse IAM raw XML into data/iam/parsed/
parse-deepwriting Parse DeepWriting JSON into data/deepwriting/parsed/
split-deepwriting Generate random train/val/test splits for DeepWriting
preprocess-iam parse-iam (splits are generated during parsing)
preprocess-deepwriting parse-deepwriting then split-deepwriting
Tokenizer
train-tokenizers Train tokenizer families (delta/vocab sweep)
eval-compression Evaluate tokenizer compression metrics
eval-oov Evaluate tokenizer OOV metrics
eval-tokenizers Runs eval-compression and eval-oov
Training
train Train all default model/task combinations
train-test Quick test run (--all --test)
train-parallel Runs scripts/train/parallel.sh
Evaluation
eval Evaluate all supported tasks
eval-htr Evaluate HTR only
eval-htg Evaluate HTG only
Plotting
plot Generate all figures and tables
plot-compression Compression figure only
plot-oov OOV figure only
plot-discretization Discretization figure only
plot-double-descent Double-descent figure only
plot-attention Attention visualization figure(s) only
plot-convergence Convergence speedup table only
plot-results Result CSV to LaTeX tables only
plot-scribe Scribe token visualization figure only
plot-htg HTG handwriting sample grid figure only
Utilities
move-checkpoints Move best checkpoint weights into models/
fetch-metrics Fetch SwanLab run metrics to CSV
fetch-compute-time Print total compute time from SwanLab runs
kill Kill processes matching scribe-tokens
check-cuda Print CUDA availability/device via PyTorch
tmux Open/attach tmux session named train

Important note on train-parallel

scripts/train/parallel.sh contains a hard-coded project path:

cd /home/ubuntu/projects/scribe-tokens

Update it for your machine before using make train-parallel.

Script Usage

All entrypoints are available as make targets (see table above). The examples below use make where a target exists, and show the underlying command for cases that take arguments.

Preprocessing

# IAM
make preprocess-iam

# DeepWriting
make preprocess-deepwriting

See Data Setup for expected raw directory layouts.

Tokenizer Analysis

# 1) train tokenizer families (delta/vocab sweep)
make train-tokenizers

# 2) evaluate tokenizer quality metrics
make eval-tokenizers

# 3) plot tokenizer metrics
make plot-compression
make plot-oov

Outputs:

  • output/results/compression.csv
  • output/results/oov.csv
  • output/figures/compression.pdf
  • output/figures/oov.pdf

Deep Learning Training

# Train all default models/tasks
make train

# Quick test mode
make train-test

# Single model (use the underlying command for CLI args)
uv run -m scripts.train.main --task HTR --repr scribe

# Optional overrides
uv run -m scripts.train.main --all --epochs 50 --batch-size 16

CLI options (from scripts/train/main.py):

  • --all or --task <TASK> (mutually exclusive)
  • --repr <scribe|point5|rel|text> (required when using --task)
  • --test
  • --experiment-name <name>
  • --epochs <int>
  • --batch-size <int>

Supported tasks:

  • HTR, HTG, NTP, HTR_SFT, HTG_SFT

Evaluation

# All supported tasks
make eval

# Individual tasks
make eval-htr
make eval-htg

# One task with args
uv run -m scripts.eval.main --task HTR

Writes CSVs to output/results/ (for example htr.csv, htg.csv).

Tables and Plots

# All figures and tables
make plot

# Individual targets
make plot-compression
make plot-oov
make plot-discretization
make plot-double-descent
make plot-attention
make plot-convergence
make plot-results
make plot-scribe
make plot-htg

# Results table for one task
uv run -m scripts.plot.results --task HTR

Utilities

make move-checkpoints       # move best checkpoint weights into models/
make fetch-metrics          # fetch SwanLab run metrics to output/results/metrics.csv
make fetch-compute-time     # print total compute time from SwanLab runs

Typical Workflow

IAM

# set DATASET="iam" in src/constants.py

make preprocess-iam
make train-tokenizers
make eval-tokenizers
make train
make eval
make plot

DeepWriting

# set DATASET="deepwriting" in src/constants.py

make preprocess-deepwriting
make train-tokenizers
make eval-tokenizers
make train
make eval
make plot

Notes

  • Run commands from the repository root.
  • Existing trained models are skipped by default in training; remove saved model files to force retraining.
  • Some utility/plot scripts assume prior artifacts exist (trained models, result CSVs, metrics CSVs).

About

ScribeTokens: Digital-ink tokenization via Bresenham decomposition and BPE over Freeman chain codes

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors