Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I had fun learning about your library and thought I might contribute. When investigating the message I kept getting that stated
Token indices sequence length is longer than the specified maximum sequence length for this model (676 > 512). Running this sequence through the model will result in indexing errors
I discovered that it originated from theTransformers
library but that it's normal insofar assemchunk
tries to optimize the ultimate chunks, to put it simplistically.Upon further research, I discovered that the
chunk_size
parameter is repeatedly consulted in order to respect that limit, but I didn't see where the tokenizer's inherent limit was consult in the same way. In theory, a user could specify achunk_size
larger than the tokenizer's inherent limit. True, they'd "probably" get the aforementioned print statement provided they haven't turned off the logger/warning or what not, but I thought it'd be nice to add some handling withinsemchunk
itself.I've commented out three proposals and added explanations within them. One came be commented out and used or they can be a source of further discussion.
In all three proposals, there's code that accounts for the blank tokenization overhead identical to how
semchunk
handles the overhead whenchunk_size
is None.