Skip to content

add max_features and tokenizer to CountVectorizer #376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 28, 2025

Conversation

marco-cloudflare
Copy link
Contributor

@marco-cloudflare marco-cloudflare commented Feb 6, 2025

add max_features and tokenizer to CountVectorizer (similar to what's available at sklearn). Note that tokenizer and regex as competing parameters, in case sklearn, it disables regex if you pass a tokenizer and gives you a warning, so here we could think of a single parameter that would encompass both.

another caveat is the serialization of the tokenizer function pointer, the workaround I made was not skip it, allow it to be reset after deserialization and keep a guard that will error if you try to use transform after deserialization without resetting a tokenizer

Copy link

codecov bot commented Feb 7, 2025

Codecov Report

Attention: Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.

Project coverage is 35.80%. Comparing base (6ab89bf) to head (0a2ea2f).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...gorithms/linfa-preprocessing/src/countgrams/mod.rs 78.57% 3 Missing ⚠️
...ms/linfa-preprocessing/src/tf_idf_vectorization.rs 40.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #376      +/-   ##
==========================================
+ Coverage   35.60%   35.80%   +0.19%     
==========================================
  Files          97       97              
  Lines        6386     6409      +23     
==========================================
+ Hits         2274     2295      +21     
- Misses       4112     4114       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@relf relf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. As you mentioned, it would be great to have only one parameter with an appropriate enum type to unify regexp and tokenizer function.
Could you also add some tests and maybe an example of the tokenizer function you introduce?

@marco-cloudflare
Copy link
Contributor Author

implemented the enum interface, documentation and testing

Copy link
Member

@relf relf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me except for clippy linting. Would you mind rebasing your branch? Thanks!

@marco-cloudflare
Copy link
Contributor Author

done

@relf relf merged commit fd4a214 into rust-ml:master Mar 28, 2025
21 checks passed
@relf
Copy link
Member

relf commented Mar 28, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants