add max_features and tokenizer to CountVectorizer #376

marco-cloudflare · 2025-02-06T13:43:59Z

add max_features and tokenizer to CountVectorizer (similar to what's available at sklearn). Note that tokenizer and regex as competing parameters, in case sklearn, it disables regex if you pass a tokenizer and gives you a warning, so here we could think of a single parameter that would encompass both.

another caveat is the serialization of the tokenizer function pointer, the workaround I made was not skip it, allow it to be reset after deserialization and keep a guard that will error if you try to use transform after deserialization without resetting a tokenizer

codecov · 2025-02-07T10:17:23Z

Codecov Report

Attention: Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.

Project coverage is 35.80%. Comparing base (6ab89bf) to head (0a2ea2f).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...gorithms/linfa-preprocessing/src/countgrams/mod.rs	78.57%	3 Missing ⚠️
...ms/linfa-preprocessing/src/tf_idf_vectorization.rs	40.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #376      +/-   ##
==========================================
+ Coverage   35.60%   35.80%   +0.19%     
==========================================
  Files          97       97              
  Lines        6386     6409      +23     
==========================================
+ Hits         2274     2295      +21     
- Misses       4112     4114       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

relf

Thanks for your contribution. As you mentioned, it would be great to have only one parameter with an appropriate enum type to unify regexp and tokenizer function.
Could you also add some tests and maybe an example of the tokenizer function you introduce?

marco-cloudflare · 2025-02-20T16:53:28Z

implemented the enum interface, documentation and testing

relf

Changes look good to me except for clippy linting. Would you mind rebasing your branch? Thanks!

algorithms/linfa-preprocessing/src/countgrams/hyperparams.rs

algorithms/linfa-preprocessing/src/countgrams/mod.rs

marco-cloudflare · 2025-03-27T15:57:14Z

done

relf · 2025-03-28T09:08:00Z

Thanks!

relf requested changes Feb 7, 2025

View reviewed changes

marco-cloudflare force-pushed the github-patch branch from 31a78fb to feec9fe Compare February 20, 2025 16:47

relf requested changes Mar 4, 2025

View reviewed changes

algorithms/linfa-preprocessing/src/countgrams/hyperparams.rs Outdated Show resolved Hide resolved

algorithms/linfa-preprocessing/src/countgrams/mod.rs Outdated Show resolved Hide resolved

add max_features and tokenizer to CountVectorizer

0a2ea2f

marco-cloudflare force-pushed the github-patch branch from feec9fe to 0a2ea2f Compare March 27, 2025 15:56

relf approved these changes Mar 28, 2025

View reviewed changes

relf merged commit fd4a214 into rust-ml:master Mar 28, 2025
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

add max_features and tokenizer to CountVectorizer #376

add max_features and tokenizer to CountVectorizer #376

Uh oh!

marco-cloudflare commented Feb 6, 2025 •

edited

Loading

Uh oh!

codecov bot commented Feb 7, 2025 •

edited

Loading

Uh oh!

relf left a comment •

edited

Loading

Uh oh!

marco-cloudflare commented Feb 20, 2025

Uh oh!

relf left a comment

Uh oh!

Uh oh!

Uh oh!

marco-cloudflare commented Mar 27, 2025

Uh oh!

Uh oh!

relf commented Mar 28, 2025

Uh oh!

Uh oh!

Uh oh!

add max_features and tokenizer to CountVectorizer #376

add max_features and tokenizer to CountVectorizer #376

Uh oh!

Conversation

marco-cloudflare commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

relf left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marco-cloudflare commented Feb 20, 2025

Uh oh!

relf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

marco-cloudflare commented Mar 27, 2025

Uh oh!

Uh oh!

relf commented Mar 28, 2025

Uh oh!

Uh oh!

marco-cloudflare commented Feb 6, 2025 •

edited

Loading

codecov bot commented Feb 7, 2025 •

edited

Loading

relf left a comment •

edited

Loading