Skip to content

Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support. (Spell checking derived from hfst-ospell)

License

Apache-2.0 and 2 other licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
GPL-3.0
LICENSE-GPL
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

divvun/divvunspell

divvunspell

CI Crates.io Documentation

A fast, feature-rich spell checking library and toolset for HFST-based spell checkers. Written in Rust, divvunspell is a modern reimplementation and extension of hfst-ospell with additional features like parallel processing, comprehensive tokenization, case handling, and morphological analysis.

Features

  • High Performance: Memory-mapped transducers and parallel suggestion generation
  • ZHFST/BHFST Support: Load standard HFST spell checker archives
  • Smart Tokenization: Unicode-aware word boundary detection with customizable alphabets
  • Case Handling: Intelligent case preservation and suggestion recasing
  • Morphological Analysis: Extract and filter suggestions based on morphological tags
  • Cross-Platform: Works on macOS, Linux, Windows, iOS and Android

Quick Start

As a Command-Line Tool

# Install the CLI
cargo install divvunspell-cli

# Check spelling and get suggestions
divvunspell suggest --archive speller.zhfst --json "sámi"

As a Rust Library

Add to your Cargo.toml:

[dependencies]
divvunspell = "1.0.0-beta.7"

Basic usage:

use divvunspell::archive::{SpellerArchive, ZipSpellerArchive};
use divvunspell::speller::{Speller, SpellerConfig, OutputMode};

// Load a spell checker archive
let archive = ZipSpellerArchive::open("language.zhfst")?;
let speller = archive.speller();

// Check if a word is correct
if !speller.clone().is_correct("wordd") {
    // Get spelling suggestions
    let config = SpellerConfig::default();
    let suggestions = speller.clone().suggest("wordd");

    for suggestion in suggestions {
        println!("{} (weight: {})", suggestion.value, suggestion.weight);
    }
}

// Morphological analysis
let analyses = speller.analyze_input("running");
for analysis in analyses {
    println!("{}", analysis.value); // e.g., "run+V+PresPartc"
}

Command-Line Tools

divvunspell

The main spell checking tool with support for suggestions, analysis, and tokenization.

# Get suggestions for a word
divvunspell suggest --archive language.zhfst "wordd"

# Always show suggestions even for correct words
divvunspell suggest --archive language.zhfst --always-suggest "word"

# Limit number and weight of suggestions
divvunspell suggest --archive language.zhfst --nbest 5 --weight 20.0 "wordd"

# JSON output
divvunspell suggest --archive language.zhfst --json "wordd"

# Tokenize text
divvunspell tokenize --archive language.zhfst "This is some text."

# Analyze word forms morphologically
divvunspell analyze-input --archive language.zhfst "running"
divvunspell analyze-output --archive language.zhfst "runing"

Options:

  • -a, --archive <FILE> - BHFST or ZHFST archive to use
  • -S, --always-suggest - Show suggestions even if word is correct
  • -w, --weight <WEIGHT> - Maximum weight limit for suggestions
  • -n, --nbest <N> - Maximum number of suggestions to return
  • --no-reweighting - Disable suggestion reweighting (closer to hfst-ospell behavior)
  • --no-recase - Disable case-aware suggestion handling
  • --json - Output results as JSON

Debugging:

Set RUST_LOG=trace to enable detailed logging:

RUST_LOG=trace divvunspell suggest --archive language.zhfst "wordd"

thfst-tools

Convert HFST and ZHFST files to optimized THFST and BHFST formats.

THFST (Tromsø-Helsinki FST): A byte-aligned HFST format optimized for fast loading and memory mapping, required for ARM processors.

BHFST (Box HFST): THFST files packaged in a box container with JSON metadata for efficient processing.

# Convert HFST to THFST
thfst-tools hfst-to-thfst acceptor.hfst acceptor.thfst

# Convert ZHFST to BHFST (recommended for distribution)
thfst-tools zhfst-to-bhfst language.zhfst language.bhfst

# Convert THFST pair to BHFST
thfst-tools thfsts-to-bhfst --errmodel errmodel.thfst --lexicon lexicon.thfst output.bhfst

# View BHFST metadata
thfst-tools bhfst-info language.bhfst

accuracy

Test spell checker accuracy against known typo/correction pairs.

# Install
cd crates/accuracy
cargo install --path .

# Run accuracy test
accuracy typos.tsv language.zhfst

# Save detailed JSON report
accuracy -o report.json typos.tsv language.zhfst

# Limit test size and save TSV summary
accuracy -w 1000 -t results.tsv typos.tsv language.zhfst

# Use custom config
accuracy -c config.json typos.tsv language.zhfst

Input format (typos.tsv): Tab-separated values with typo in first column, expected correction in second:

wordd    word
recieve  receive
teh      the

Accuracy viewer (prototype web UI):

accuracy -o support/accuracy-viewer/public/report.json typos.txt language.zhfst
cd support/accuracy-viewer
npm i && npm run dev
# Open http://localhost:5000

Building from Source

Install Rust

curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env
rustup default stable

Build Everything

# Build all crates
cargo build --release

# Install specific tools
cargo install --path ./cli          # divvunspell CLI
cargo install --path ./crates/thfst-tools
cargo install --path ./crates/accuracy

Run Tests

cargo test

Documentation

License

The divvunspell library is dual-licensed under:

You may choose either license for library use.

The command-line tools (divvunspell, thfst-tools, accuracy) are licensed under GPL-3.0 (LICENSE-GPL).

About

Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support. (Spell checking derived from hfst-ospell)

Topics

Resources

License

Apache-2.0 and 2 other licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
GPL-3.0
LICENSE-GPL
MIT
LICENSE-MIT

Stars

Watchers

Forks

Contributors 12