Skip to content

yizhiru/scala-AC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scala implementation of Aho-Corasick algorithm

A byte-oriented Aho-Corasick implementation in pure Scala. Keywords can carry a user-defined property tag.

Features

  • UTF-8 is used explicitly for string APIs, so behavior is consistent across platforms
  • Failure links are built automatically on the first search (or call build() eagerly)
  • Array-based goto table and vocabulary indexing for better performance and memory use on large dictionaries

Requirements

  • Java 11+
  • Maven 3.6+

Build and test

mvn test

Usage

import io.yizhiru.ac.Automaton

val ac = new Automaton
ac.addWords("pronoun", "he", "she", "his", "hers")

// build() is optional; search triggers construction automatically
val results: Set[(String, String)] = ac.search("ushers")
// Set(("he", "pronoun"), ("she", "pronoun"), ("hers", "pronoun"))

Search raw bytes when you already have UTF-8 (or other byte-level) data:

ac.addWordBytes("tag", "he".getBytes(java.nio.charset.StandardCharsets.UTF_8))
ac.searchBytes("ushers".getBytes(java.nio.charset.StandardCharsets.UTF_8))

API notes

Method Description
addWord(property, word) Register one UTF-8 keyword
addWords(property, words*) Register multiple keywords with the same property
addWordBytes(property, bytes) Register a keyword from raw bytes
build() Eagerly construct failure links and complete the goto table
search(text) Match UTF-8 text, returns Set[(keyword, property)]
searchBytes(data) Match raw bytes

setFailTransitions() is deprecated and kept only for backward compatibility; it delegates to build().

Duplicate matches of the same (keyword, property) pair in one text are deduplicated because results are returned as a Set.

Encoding

All string-based methods encode and decode with StandardCharsets.UTF_8. If your input uses another charset, convert to bytes yourself and use addWordBytes / searchBytes.

About

Scala implementation of Aho-Corasick algorithm

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages