Skip to content

chunk size precautions #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BBC-Esq
Copy link

@BBC-Esq BBC-Esq commented Feb 16, 2025

I had fun learning about your library and thought I might contribute. When investigating the message I kept getting that stated Token indices sequence length is longer than the specified maximum sequence length for this model (676 > 512). Running this sequence through the model will result in indexing errors I discovered that it originated from the Transformers library but that it's normal insofar as semchunk tries to optimize the ultimate chunks, to put it simplistically.

Upon further research, I discovered that the chunk_size parameter is repeatedly consulted in order to respect that limit, but I didn't see where the tokenizer's inherent limit was consult in the same way. In theory, a user could specify a chunk_size larger than the tokenizer's inherent limit. True, they'd "probably" get the aforementioned print statement provided they haven't turned off the logger/warning or what not, but I thought it'd be nice to add some handling within semchunk itself.

I've commented out three proposals and added explanations within them. One came be commented out and used or they can be a source of further discussion.

In all three proposals, there's code that accounts for the blank tokenization overhead identical to how semchunk handles the overhead when chunk_size is None.

@umarbutler
Copy link
Collaborator

Hey @BBC-Esq,
Sorry, it's taken me a while to get to this.

Proposals 2 and 3 are a no-go because there are many good reasons why someone would want to chunk text into chunk sizes larger than a tokenizer's model_max_length attribute. For example, maybe someone is training a model using such a tokenizer to have a higher context window than the context window of the model that the tokenizer was originally intended for. Additionally, the model_max_length attribute can be inaccurate for some models, speaking from experience.

With regard to Proposal 1, it has two components:

  1. You adjust the chunk size by deducting overhead.
  2. You issue a warning if the chunk size exceeds the model_max_length attribute.

The first component is a no-go because:

  1. It is not possible to determine with absolute certainty the number of overhead tokens. Yes, I do try to do exactly that using len(tokenizer_or_token_counter.encode("")), which I regard as acceptable only when no chunk size is specified, because in that scenario, the user is ceding control to semchunk to make that decision for them.
  2. Your behavior of suppressing exceptions if we can't deduct the overhead could lead to inconsistent behavior between tokenizers where len(tokenizer_or_token_counter.encode("")) doesn't work.
  3. Most importantly, this would be a big, breaking change to existing code.
  4. What if someone actually doesn't want to deduct overhead?

I am open to adding an optional flag to auto-remove overhead that is set to false by default. Is that something you think would be helpful?

Regarding the second component, I think it's reasonable to issue such a warning, provided we only show it to the user once, and we introduce a flag to disable the warning.

Would you like to implement all this, or would you like for me to implement it? I don't mind, either way :)

By the way, thank you for making the effort to submit a PR; it's much appreciated!

@BBC-Esq
Copy link
Author

BBC-Esq commented Mar 6, 2025

Hey, thanks for your response. I wouldn't mind implementing but it's just a lack of time. When I submitted the PR I just tested your library, liked it, and have a habit of trying to contribute real quick to a repo as a "thanks." But if you like some of the ideas I had feel free...just have a lot on my plate right now, or feel free to disregard if you have second thoughts and don't see the need...either way I won't be offended. ;-)

@umarbutler
Copy link
Collaborator

No worries, totally understand, same deal here :) I'll get around to this but may put it on the backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants