-
-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add a simple way to chain two tokenizers #2304
base: main
Are you sure you want to change the base?
Conversation
04fe71e
to
7068477
Compare
I applied nightly |
@ctron can you describe your use case? |
My use case is to have all simple tokens plus all ngrams. |
Done, | ||
} | ||
|
||
pub struct ChainTokenStream<'a, F, S> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Token::position
fields would need updating though, wouldn't they? Meaning the positions of the second stream should be offset by the number of tokens yielded by the first one?
I was able to incorporate most of the feedback you mentioned. It's less explicit without the I am not sure about the I'll admit that the whole API around tokenization feels a bit confusing. |
09f3012
to
9bc739d
Compare
Agreed, it is more implicit. But I was mainly suggesting it for efficiency, i.e. keep the state as small as possible and drop the first tokenizer as soon as we are done with it. If you feel like code-golfing, I think those two calls to fn advance(&mut self) -> bool {
if let Some(first) = &mut self.first {
if first.advance() {
return true;
} else {
self.first = None;
}
}
self.second.advance()
} |
9bc739d
to
92e3c5f
Compare
I like that, pushed. So, the remaining thing seems to be the position. I am just not sure what to do with it. |
I'd say let's wait for input from @fulmicoton on that. I am myself unsure what downstream consumers expect of the position field. I suspect that this is mainly used for phrase queries with slop which I think would make the current implementation correct, i.e. have |
@PSeitz can you review ? |
Fixed the test issue. |
There is a hidden contract currently on the tokenizer API which expects positions to be sorted incrementally. This happens in the serialization part in recorder.rs, when positions are delta encoded. There are two options:
I think handling unsorted positions is not really favorable, since it would carry some performance overhead. |
So in this case, we'd need to interleave the output of the two tokenizer dynamically to ensure that one does not outpace the other. Or could just make positions up and offset all positions returned by the second tokenizer? |
No description provided.