New lines no longer included in extract_text() on 4.3 for a specific PDF file #2777
Labels
is-regression
Regression introduced as a side-effect of another change
workflow-text-extraction
From a users perspective, text extraction is the affected feature/workflow
Hi! The ATP rankings are published as a PDF that I'm trying to parse, but since pypdf 4.3 calling
extract_text()
no longer includes new line characters.This worked fine on pypdf 4.2, so I did a git bisect. That suggests that this issue was introduced in commit 23a81ba.
Environment
This is with Python 3.12.4 in a venv on Debian testing.
Code + PDF
The following PDF is the first page of the published results for Jul 22, 2024:
singles_entry_numerical_2024_07_22_firstpage.pdf
When running this with pypdf 4.2, the extracted text contains new line characters just fine:
But on 4.3, new lines are no longer included:
Traceback
N/A
The text was updated successfully, but these errors were encountered: