Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of large documents #90

Open
daavoo opened this issue Jan 14, 2025 · 1 comment
Open

Improve handling of large documents #90

daavoo opened this issue Jan 14, 2025 · 1 comment
Labels
data cleaning Related to the data cleaning module data loading Related to the data loading module help wanted Extra attention is needed

Comments

@daavoo
Copy link
Contributor

daavoo commented Jan 14, 2025

Motivation

The current implementation can't process documents that are larger than the model input context.

In the current implementation, the text is trimmed to only keep the tokens that fit the model context:

# ~4 characters per token is considered a reasonable default.
max_characters = text_model.n_ctx() * 4
if len(clean_text) > max_characters:
st.warning(
f"Input text is too big ({len(clean_text)})."
f" Using only a subset of it ({max_characters})."
)
clean_text = clean_text[:max_characters]

This behavior is not ideal because:

  • Information is lost
    The final podcast only contains content related to the part that was kept.
  • (Ocasionally) causes unexpected
    It seems that the model tends to collapse when the input text is close to it's context limit.

Alternatives

Since we want to keep the blueprint suitable for low-compute resources, simply using a model with larger context is not a viable solution.

We can explore smarter ways of handling large documents.

For example, we can try to split the large document and iteratively feed it to the text-to-text model.
Prompts would need to be adjusted to make sure that the first and last chunks are processed differently, adding an introduction and closure.

@daavoo daavoo added data cleaning Related to the data cleaning module data loading Related to the data loading module help wanted Extra attention is needed labels Jan 14, 2025
@HackyRoot
Copy link

How about this? We can split the large documents in small chunks, as you suggested. Along with the chunks, we can pass the context of the previous chunk in the form of summary to ensure that there's continuity between the chunks.

For the closure part, we can just reuse all these chunk summaries. Not sure about the introduction though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data cleaning Related to the data cleaning module data loading Related to the data loading module help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants