Skip to content

Conversation

@ptr07
Copy link

@ptr07 ptr07 commented Nov 25, 2025

What

This commit adds support for Polish language stemming.

Why

The previously used rust-stemmers crate is abandoned and unmaintained, which blocked the addition of new languages. This change addresses a user request for Polish stemming to improve BM25 recall in their use case. The tantivy-stemmers crate is a modern, maintained alternative that also opens the door for supporting many other languages in the future.

How

  • Added the tantivy-stemmers crate as a dependency to the workspace, alongside the existing rust-stemmers dependency (for backward compatibility)
  • Introduced an internal enum that can hold an algorithm from either rust-stemmers or tantivy-stemmers
  • Added Polish to the main Language enum, mapped to the new tantivy-stemmers implementation
  • Updated the token stream to handle both types of stemmers internally
  • Added the POLISH variant to the stopwords list

Tests

  • Existing tests pass
  • Added test_pl_tokenizer to verify that the Polish stemmer works correctly

feat(tokenizer): add support for Polish language stemming

- Introduced Tantivy stemmer algorithm for Polish language alongside
  Rust stemmers within Language enum and stemmer logic.
- Added stemmer implementation and integration for Polish.
- Expanded the stop-word filter to include a list of Polish stop words.
- Updated tests to validate Polish tokenizer behavior.
- Included the `tantivy-stemmers` dependency to support Polish language
  stemming.

Polish language support enhances language coverage for text processing.
```

# Conflicts:
#	Cargo.lock
- Eliminated an unused import for Polish language in stemmer.rs.
- Cleaned up redundant code to improve maintainability.
- Applied consistent formatting to `StemmerAlgorithm` and `token_stream`.
- Simplified match expressions for better readability and maintainability.
- Removed unnecessary blank lines in tokenizer tests.

These changes enhance code clarity and maintain coding style consistency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant