High-performance, Unicode-aware safe Rust implementation of the Aho–Corasick automaton with fuzzy matching ( insertions, deletions, substitutions, transpositions).
- Exact & Fuzzy Matching: Match literal patterns or allow configurable approximate matching with edit operations ( Levenshtein-style + transposition).
- Unicode-Aware: Operates over grapheme clusters, with optional case-insensitive matching.
- Fine-Grained Limits: Per-pattern caps on insertions, deletions, substitutions, swaps, and total edits.
- Non-Overlapping Selection: Choose a maximal set of non-overlapping matches with configurable heuristics.
- Fuzzy Replacer: Find-and-replace fuzzily while preserving surrounding context.
- Segmentation API: Split input into matched / unmatched segments via
segment_iter/segment_text. - Customizable Scoring: Weighting and penalty tuning for substitution, insertion, deletion, swap.
Add to your Cargo.toml:
[dependencies]
fuzzy-aho-corasick = "0.3"Then in code:
use fuzzy_aho_corasick::{FuzzyAhoCorasickBuilder, FuzzyLimits};use fuzzy_aho_corasick::{FuzzyAhoCorasickBuilder, FuzzyLimits};
fn main() {
// Build engine allowing up to 1 edit per match, case-insensitive.
let engine = FuzzyAhoCorasickBuilder::new()
.fuzzy(FuzzyLimits::new().edits(1))
.case_insensitive(true)
.build(["hello", "world"]);
let text = "H3llo W0rld!";
let matches = engine.search_non_overlapping(text, 0.8);
for m in matches.iter() {
println!(
"matched pattern '{}' as '{}' (score {:.2})",
m.pattern, m.text, m.similarity
);
}
// Expected output (approximate):
// matched pattern 'hello' as 'H3llo' (score 0.90)
}Customize fuzzy behavior, penalties, and case folding:
use fuzzy_aho_corasick::{FuzzyAhoCorasickBuilder, FuzzyLimits, FuzzyPenalties};
fn engine() {
FuzzyAhoCorasickBuilder::new()
.fuzzy(FuzzyLimits::new().edits(2)) // up to 2 total edits per pattern by default
.penalties(
FuzzyPenalties::default()
.substitution(0.7)
.insertion(0.9)
.deletion(0.9)
.swap(1.0),
)
.case_insensitive(true)
.build(["pattern1", "pattern2"])
}By default, all patterns have weight 1.0, but you can adjust per-pattern scoring and fuzzy limits directly via
Pattern before building the automaton. You can also pass tuples or fully constructed Pattern instances to
build(...).
Pattern::weight(f32): set the pattern’s weight (default1.0), affecting its effective similarity score.Pattern::fuzzy(FuzzyLimits): apply per-pattern edit limits.Pattern::custom_unique_id(usize): give a stable identity to the pattern for uniqueness-aware matching (non_overlapping_unique).
Weighting (e.g., give importance to certain patterns) and case-insensitive Greek:
fn main() {
use fuzzy_aho_corasick::{FuzzyAhoCorasickBuilder, Pattern};
// Build from tuple list; each tuple is (pattern, weight).
let engine = FuzzyAhoCorasickBuilder::new()
.case_insensitive(true)
.build([("Γειά", 1.0), ("σου", 1.0)]);
}Customizing patterns with fuzzy limits, unique IDs, and weights:
fn main() {
use fuzzy_aho_corasick::{FuzzyAhoCorasickBuilder, Pattern, FuzzyLimits};
let p1 = Pattern::new("error")
.weight(2.0) // boost this pattern
.fuzzy(FuzzyLimits::new().edits(1))
.custom_unique_id(42);
let p2 = Pattern::new("warning")
.weight(1.0)
.fuzzy(FuzzyLimits::new().edits(2));
// Build engine from explicit Pattern objects.
let engine = FuzzyAhoCorasickBuilder::new()
.build([p1, p2]);
}These allow fine-grained control over ranking, deduplication, and fuzzy tolerance on a per-pattern basis.
There are helpers to control ordering and overlap resolution:
default_sort(): Prioritizes higher similarity, longer patterns, then earlier position.greedy_sort(): Prefers longer patterns first, then similarity.non_overlapping(): Drops overlapping matches greedily in the current order.non_overlapping_unique(): Same as above but ensures each pattern (respectingcustom_unique_id) is used at most once.
Convenience entrypoints:
search(...): Applies default sort and returns non-overlapping matches.search_greedy(...): Applies greedy sort.search_non_overlapping(...)/search_non_overlapping_unique(...): Variants that combine sorting + deduplication semantics.
Break text into matched/unmatched pieces and reassemble with intelligent spacing:
fn main() {
let engine = FuzzyAhoCorasickBuilder::new()
.fuzzy(FuzzyLimits::new().edits(1))
.build(["input", "more"]);
let matches = engine.search_non_overlapping("someinptandm0re", 0.75);
let segmented_text = matches.segment_text();
assert_eq!(segmented_text, "some inpt and m0re");
}You can treat each fuzzy match as a delimiter and collect the unmatched pieces:
-
FuzzyMatches::split()
Splits the already‐segmented stream, returning aVec<String>of allUnmatchedparts (including empty ones if matches touch the ends). -
FuzzyAhoCorasick::split(haystack, threshold)
Convenience: runssearch_non_overlapping(haystack, threshold)and immediately callssplit()on the result.
Example
use fuzzy_aho_corasick::{FuzzyAhoCorasickBuilder, FuzzyLimits};
fn main() {
let engine = FuzzyAhoCorasickBuilder::new()
.fuzzy(FuzzyLimits::new().edits(1))
.case_insensitive(true)
.build(["FOO", "BAR"]);
// Treat each fuzzy match (≥0.8) as a separator:
let parts = engine.split("xxFOOyyBARzz", 0.8);
assert_eq!(parts, vec!["xx", "yy", "zz"]);
}Once you have a FuzzyMatches (for example, from search_non_overlapping), these handy methods let you trim or
transform the matched segments:
-
replace(callback)Walks through matches left-to-right (skipping overlaps) and invokes yourFn(&FuzzyMatch) -> Option<&str>for each one; if the callback returnsSome(repl), that span is replaced withrepl, otherwise the original matched text is preserved. See Fuzzy Replacer section for a basic replacer. -
strip_prefix()Drops all leading fuzzy-matched segments and any whitespace-only unmatched segments, then trims leading spaces from the first kept unmatched segment and returns the rest of the text. -
strip_postfix()Removes all trailing fuzzy-matched segments and any whitespace-only unmatched segments, then trims trailing spaces from the last kept unmatched segment and returns the leading portion of the text.
Perform fuzzy find-and-replace with a mapping. Non-overlapping matches are chosen automatically based on default heuristics.
fn main() {
let replacer = FuzzyAhoCorasickBuilder::new().build_replacer([
("foo", "bar"),
("baz", "qux"),
]);
let out = replacer.replace("F00 and BAZ!", 0.8);
assert_eq!(out, "bar and qux!");
}- Prefer filtering by similarity threshold early to prune low-quality candidates.
- Tune
FuzzyLimitsper pattern when you know expected error characteristics. - Custom
FuzzyPenaltiescan shape whether substitutions or insertions are “cheaper” in ambiguous regions. - Use
search_non_overlapping_*variants to avoid post-processing overlapping match resolution.
- Too many false positives: Raise the similarity threshold or tighten per-pattern limits.
- Missing fuzzy matches: Lower the threshold, increase allowed edits, or adjust penalties to make substitutions less punitive.
- Overlapping matches unwanted: Use
search_non_overlappingorsearch_non_overlapping_uniqueinstead of rawsearch_unsorted.
See examples/ (if present) for real-world usage patterns:
- Fuzzy deduplication pipelines
- Fuzzy find-and-replace in human-entered text
- Entity extraction with per-entity error tolerance
Run the test suite with:
cargo testDistributed under the MIT License. See LICENSE for details.
Based on a research paper — Fuzzified Aho–Corasick Search Automata by Z. Horák, V. Snášel, A. Abraham, and A. E. Hassanien.