chunk size precautions #16

BBC-Esq · 2025-02-16T15:38:05Z

I had fun learning about your library and thought I might contribute. When investigating the message I kept getting that stated Token indices sequence length is longer than the specified maximum sequence length for this model (676 > 512). Running this sequence through the model will result in indexing errors I discovered that it originated from the Transformers library but that it's normal insofar as semchunk tries to optimize the ultimate chunks, to put it simplistically.

Upon further research, I discovered that the chunk_size parameter is repeatedly consulted in order to respect that limit, but I didn't see where the tokenizer's inherent limit was consult in the same way. In theory, a user could specify a chunk_size larger than the tokenizer's inherent limit. True, they'd "probably" get the aforementioned print statement provided they haven't turned off the logger/warning or what not, but I thought it'd be nice to add some handling within semchunk itself.

I've commented out three proposals and added explanations within them. One came be commented out and used or they can be a source of further discussion.

In all three proposals, there's code that accounts for the blank tokenization overhead identical to how semchunk handles the overhead when chunk_size is None.

umarbutler · 2025-03-06T11:15:17Z

Hey @BBC-Esq,
Sorry, it's taken me a while to get to this.

Proposals 2 and 3 are a no-go because there are many good reasons why someone would want to chunk text into chunk sizes larger than a tokenizer's model_max_length attribute. For example, maybe someone is training a model using such a tokenizer to have a higher context window than the context window of the model that the tokenizer was originally intended for. Additionally, the model_max_length attribute can be inaccurate for some models, speaking from experience.

With regard to Proposal 1, it has two components:

You adjust the chunk size by deducting overhead.
You issue a warning if the chunk size exceeds the model_max_length attribute.

The first component is a no-go because:

It is not possible to determine with absolute certainty the number of overhead tokens. Yes, I do try to do exactly that using len(tokenizer_or_token_counter.encode("")), which I regard as acceptable only when no chunk size is specified, because in that scenario, the user is ceding control to semchunk to make that decision for them.
Your behavior of suppressing exceptions if we can't deduct the overhead could lead to inconsistent behavior between tokenizers where len(tokenizer_or_token_counter.encode("")) doesn't work.
Most importantly, this would be a big, breaking change to existing code.
What if someone actually doesn't want to deduct overhead?

I am open to adding an optional flag to auto-remove overhead that is set to false by default. Is that something you think would be helpful?

Regarding the second component, I think it's reasonable to issue such a warning, provided we only show it to the user once, and we introduce a flag to disable the warning.

Would you like to implement all this, or would you like for me to implement it? I don't mind, either way :)

By the way, thank you for making the effort to submit a PR; it's much appreciated!

BBC-Esq · 2025-03-06T11:24:55Z

Hey, thanks for your response. I wouldn't mind implementing but it's just a lack of time. When I submitted the PR I just tested your library, liked it, and have a habit of trying to contribute real quick to a repo as a "thanks." But if you like some of the ideas I had feel free...just have a lot on my plate right now, or feel free to disregard if you have second thoughts and don't see the need...either way I won't be offended. ;-)

umarbutler · 2025-03-06T11:26:46Z

No worries, totally understand, same deal here :) I'll get around to this but may put it on the backlog.

Update semchunk.py

9848b61

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunk size precautions #16

chunk size precautions #16

BBC-Esq commented Feb 16, 2025 •

edited

Loading

umarbutler commented Mar 6, 2025

BBC-Esq commented Mar 6, 2025

umarbutler commented Mar 6, 2025

chunk size precautions #16

Are you sure you want to change the base?

chunk size precautions #16

Conversation

BBC-Esq commented Feb 16, 2025 • edited Loading

umarbutler commented Mar 6, 2025

BBC-Esq commented Mar 6, 2025

umarbutler commented Mar 6, 2025

BBC-Esq commented Feb 16, 2025 •

edited

Loading