A byte-oriented Aho-Corasick implementation in pure Scala. Keywords can carry a user-defined property tag.
- UTF-8 is used explicitly for string APIs, so behavior is consistent across platforms
- Failure links are built automatically on the first search (or call
build()eagerly) - Array-based goto table and vocabulary indexing for better performance and memory use on large dictionaries
- Java 11+
- Maven 3.6+
mvn testimport io.yizhiru.ac.Automaton
val ac = new Automaton
ac.addWords("pronoun", "he", "she", "his", "hers")
// build() is optional; search triggers construction automatically
val results: Set[(String, String)] = ac.search("ushers")
// Set(("he", "pronoun"), ("she", "pronoun"), ("hers", "pronoun"))Search raw bytes when you already have UTF-8 (or other byte-level) data:
ac.addWordBytes("tag", "he".getBytes(java.nio.charset.StandardCharsets.UTF_8))
ac.searchBytes("ushers".getBytes(java.nio.charset.StandardCharsets.UTF_8))| Method | Description |
|---|---|
addWord(property, word) |
Register one UTF-8 keyword |
addWords(property, words*) |
Register multiple keywords with the same property |
addWordBytes(property, bytes) |
Register a keyword from raw bytes |
build() |
Eagerly construct failure links and complete the goto table |
search(text) |
Match UTF-8 text, returns Set[(keyword, property)] |
searchBytes(data) |
Match raw bytes |
setFailTransitions() is deprecated and kept only for backward compatibility; it delegates to build().
Duplicate matches of the same (keyword, property) pair in one text are deduplicated because results are returned as a Set.
All string-based methods encode and decode with StandardCharsets.UTF_8. If your input uses another charset, convert to bytes yourself and use addWordBytes / searchBytes.