Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add a simple way to chain two tokenizers #2304

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ctron
Copy link

@ctron ctron commented Jan 18, 2024

No description provided.

@ctron
Copy link
Author

ctron commented Jan 19, 2024

I applied nightly rustfmt.

@fulmicoton
Copy link
Collaborator

@ctron can you describe your use case?

@ctron
Copy link
Author

ctron commented Jan 19, 2024

My use case is to have all simple tokens plus all ngrams.

src/tokenizer/chain.rs Outdated Show resolved Hide resolved
src/tokenizer/chain.rs Outdated Show resolved Hide resolved
Done,
}

pub struct ChainTokenStream<'a, F, S>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Token::position fields would need updating though, wouldn't they? Meaning the positions of the second stream should be offset by the number of tokens yielded by the first one?

src/tokenizer/chain.rs Outdated Show resolved Hide resolved
@ctron
Copy link
Author

ctron commented Jan 22, 2024

I was able to incorporate most of the feedback you mentioned. It's less explicit without the enum, but works the same way. There was just one call to second.advance() missing, as a TokenStream seems to start before the first item.

I am not sure about the position topic, but I also don't fully understand it: If the idea is to enumerate/count all tokens, then yes that should be offset but the length of first. If that reflects the position of the token as part of the original input, then that might not be correct.

I'll admit that the whole API around tokenization feels a bit confusing.

@adamreichold
Copy link
Collaborator

I was able to incorporate most of the feedback you mentioned. It's less explicit without the enum, but works the same way. There was just one call to second.advance() missing, as a TokenStream seems to start before the first item.

Agreed, it is more implicit. But I was mainly suggesting it for efficiency, i.e. keep the state as small as possible and drop the first tokenizer as soon as we are done with it. If you feel like code-golfing, I think those two calls to second.advance() could even be merged:

fn advance(&mut self) -> bool {
    if let Some(first) = &mut self.first {
        if first.advance() {
            return true;
        } else {
            self.first = None;
        }
    }
    
    self.second.advance()
}

@ctron
Copy link
Author

ctron commented Jan 22, 2024

If you feel like code-golfing, I think those two calls to second.advance() could even be merged

I like that, pushed.

So, the remaining thing seems to be the position. I am just not sure what to do with it.

@adamreichold
Copy link
Collaborator

So, the remaining thing seems to be the position. I am just not sure what to do with it.

I'd say let's wait for input from @fulmicoton on that. I am myself unsure what downstream consumers expect of the position field. I suspect that this is mainly used for phrase queries with slop which I think would make the current implementation correct, i.e. have position still refer to the original input.

@fulmicoton
Copy link
Collaborator

@PSeitz can you review ?

@ctron
Copy link
Author

ctron commented Jan 23, 2024

Fixed the test issue.

@PSeitz
Copy link
Contributor

PSeitz commented Feb 5, 2024

There is a hidden contract currently on the tokenizer API which expects positions to be sorted incrementally. This happens in the serialization part in recorder.rs, when positions are delta encoded.

There are two options:

  • Handle unsorted positions in the serialization (allow unsorted positions from the tokenizer)
  • Return the tokens in a sorted way

I think handling unsorted positions is not really favorable, since it would carry some performance overhead.

@adamreichold
Copy link
Collaborator

Return the tokens in a sorted way

So in this case, we'd need to interleave the output of the two tokenizer dynamically to ensure that one does not outpace the other. Or could just make positions up and offset all positions returned by the second tokenizer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants