Improve handling of large documents #90

daavoo · 2025-01-14T10:36:37Z

Motivation

The current implementation can't process documents that are larger than the model input context.

In the current implementation, the text is trimmed to only keep the tokens that fit the model context:

Lines 128 to 135 in e26a3de

    
           # ~4 characters per token is considered a reasonable default. 
        
           max_characters = text_model.n_ctx() * 4 
        
           if len(clean_text) > max_characters: 
        
               st.warning( 
        
                   f"Input text is too big ({len(clean_text)})." 
        
                   f" Using only a subset of it ({max_characters})." 
        
               ) 
        
               clean_text = clean_text[:max_characters]

This behavior is not ideal because:

Information is lost
The final podcast only contains content related to the part that was kept.
(Ocasionally) causes unexpected
It seems that the model tends to collapse when the input text is close to it's context limit.

Alternatives

Since we want to keep the blueprint suitable for low-compute resources, simply using a model with larger context is not a viable solution.

We can explore smarter ways of handling large documents.

For example, we can try to split the large document and iteratively feed it to the text-to-text model.
Prompts would need to be adjusted to make sure that the first and last chunks are processed differently, adding an introduction and closure.

HackyRoot · 2025-01-21T05:53:07Z

How about this? We can split the large documents in small chunks, as you suggested. Along with the chunks, we can pass the context of the previous chunk in the form of summary to ensure that there's continuity between the chunks.

For the closure part, we can just reuse all these chunk summaries. Not sure about the introduction though.

daavoo added data cleaning Related to the data cleaning module data loading Related to the data loading module help wanted Extra attention is needed labels Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of large documents #90

Improve handling of large documents #90

daavoo commented Jan 14, 2025 •

edited

Loading

HackyRoot commented Jan 21, 2025

Improve handling of large documents #90

Improve handling of large documents #90

Comments

daavoo commented Jan 14, 2025 • edited Loading

Motivation

Alternatives

HackyRoot commented Jan 21, 2025

daavoo commented Jan 14, 2025 •

edited

Loading