Semantic Chunking Chunk Size Bug #11

seankim658 · 2024-07-08T01:26:53Z

Llamaindex's SemanticSplitterNodeParser can sometimes produce chunks that are too large for the embedding model. Unfortunately there is no max length option for the semantic chunking to avoid this issue.

Will have to eventually subclass the SemanticSplitterNodeParser and create a two level safety net that will naively split large chunks into sub-chunks in order to stay under the embedding model input token limits.

Reference:
run-llama/llama_index#12270

The text was updated successfully, but these errors were encountered:

a-gorczew · 2024-07-16T12:34:28Z

I'm observing the same issue and not sometimes but for the every library I'm trying to parse using it. Without fixing it, seems like this node parses is useless. Error which I'm observing:

\venv\lib\site-packages\openai\_base_client.py", line 993, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 8193 tokens (8193 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

seankim658 · 2024-07-17T04:09:18Z

@a-gorczew yeah I haven't played around too much with it after initially running into the chunk size issue. I think I tried it with some different breakpoint_percentile_threshold values but not much else besides that as its been low priority.

seankim658 added the bug Something isn't working label Jul 8, 2024

seankim658 self-assigned this Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Chunking Chunk Size Bug #11

Semantic Chunking Chunk Size Bug #11

seankim658 commented Jul 8, 2024 •

edited

Loading

a-gorczew commented Jul 16, 2024

seankim658 commented Jul 17, 2024

Semantic Chunking Chunk Size Bug #11

Semantic Chunking Chunk Size Bug #11

Comments

seankim658 commented Jul 8, 2024 • edited Loading

a-gorczew commented Jul 16, 2024

seankim658 commented Jul 17, 2024

seankim658 commented Jul 8, 2024 •

edited

Loading