-
-
Notifications
You must be signed in to change notification settings - Fork 279
add max_features and tokenizer to CountVectorizer #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #376 +/- ##
==========================================
+ Coverage 35.60% 35.80% +0.19%
==========================================
Files 97 97
Lines 6386 6409 +23
==========================================
+ Hits 2274 2295 +21
- Misses 4112 4114 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution. As you mentioned, it would be great to have only one parameter with an appropriate enum type to unify regexp and tokenizer function.
Could you also add some tests and maybe an example of the tokenizer function you introduce?
31a78fb
to
feec9fe
Compare
implemented the enum interface, documentation and testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me except for clippy linting. Would you mind rebasing your branch? Thanks!
feec9fe
to
0a2ea2f
Compare
done |
Thanks! |
add max_features and tokenizer to CountVectorizer (similar to what's available at sklearn). Note that tokenizer and regex as competing parameters, in case sklearn, it disables regex if you pass a tokenizer and gives you a warning, so here we could think of a single parameter that would encompass both.
another caveat is the serialization of the tokenizer function pointer, the workaround I made was not skip it, allow it to be reset after deserialization and keep a guard that will error if you try to use transform after deserialization without resetting a tokenizer