Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont #2605

zailushang2006 · 2024-04-22T08:19:57Z

I need to extract text from a PDF document using the page.extract_text function, but all the extracted Chinese characters are garbled. I suspect that this PDF document uses several special Chinese fonts: /TJQCZS+FzBookMaker2DlFont. I used debug to examine the source code of PyPDF, and in the /Font->/Encoding->/Differences mapping table, characters are mapped to special encodings as follows:

{'/Differences': [35, '/G23', 36, '/G24', 37, '/G25', 38, '/G26', 39, '/G27', 40, '/G28', 41, '/G29', 42, '/G2A', 43, '/G2B', 44, '/G2C', 45, '/G2D', 46, '/G2E', 47, '/G2F', 48, '/G30', 49, '/G31'], '/Type': '/Encoding'}

The font file is decoded using the specified /Filter: /FlateDecode under /Font->/FontDescriptor->/FontFile3, but the font file is garbled.

Since Adobe Acrobat can display the text correctly, there must be another way to handle this. I am not very familiar with the structure and protocols of PDF documents, so I am unsure how to resolve this issue.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('cryptography', '42.0.2'), PIL=10.2.0

Code + PDFex

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader(pdf_path)

number_of_pages = len(reader.pages)
print(f"Number of pages: {number_of_pages}")
for i in range(number_of_pages):
    if i != 3:
        continue
    page = reader.pages[i]

    text = page.extract_text()
    print(text[:5000])

Share here the PDF file(s) that cause the issue.
GB+15322.2-2019.pdf

Traceback

This is the complete traceback I see:

page 3 (start 0):

print result:

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2024-04-23T21:19:38Z

The fact that Adobe is able to display glyphs (images or drawings) does not mean it can associate them with some characters. copy paste using acrobat reader, pdf.JS (firefox) or PDFium (chrome) does not provide results. I strongly doubt, there is an easy way to extract data. My only approach would be to build/print to images and then use an OCR to extract text. This is out of pypdf capabilities.

stefan6419846 · 2024-04-24T05:27:11Z

As far as I have seen yesterday, pdftotext/poppler would indeed provide somehow valid results for page 4.

stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean) labels Apr 23, 2024

stefan6419846 changed the title ~~extract_text extract text Error. /BaseFont is /TJQCZS+FzBookMaker2DlFont20536874081.~~ Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont #2605

Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont #2605

zailushang2006 commented Apr 22, 2024 •

edited

Loading

pubpub-zz commented Apr 23, 2024

stefan6419846 commented Apr 24, 2024

Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont #2605

Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont #2605

Comments

zailushang2006 commented Apr 22, 2024 • edited Loading

Environment

Code + PDFex

Traceback

page 3 (start 0):

print result:

pubpub-zz commented Apr 23, 2024

stefan6419846 commented Apr 24, 2024

zailushang2006 commented Apr 22, 2024 •

edited

Loading