Improve handling of large documents #90
Labels
data cleaning
Related to the data cleaning module
data loading
Related to the data loading module
help wanted
Extra attention is needed
Motivation
The current implementation can't process documents that are larger than the model input context.
In the current implementation, the text is trimmed to only keep the tokens that fit the model context:
document-to-podcast/demo/app.py
Lines 128 to 135 in e26a3de
This behavior is not ideal because:
The final podcast only contains content related to the part that was kept.
It seems that the model tends to collapse when the input text is close to it's context limit.
Alternatives
Since we want to keep the blueprint suitable for low-compute resources, simply using a model with larger context is not a viable solution.
We can explore smarter ways of handling large documents.
For example, we can try to split the large document and iteratively feed it to the text-to-text model.
Prompts would need to be adjusted to make sure that the first and last chunks are processed differently, adding an introduction and closure.
The text was updated successfully, but these errors were encountered: