Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont #2605
Labels
is-cjk-issue
Issue related to CJK (Chinese-Japanese-Korean)
workflow-text-extraction
From a users perspective, text extraction is the affected feature/workflow
I need to extract text from a PDF document using the
page.extract_text
function, but all the extracted Chinese characters are garbled. I suspect that this PDF document uses several special Chinese fonts:/TJQCZS+FzBookMaker2DlFont
. I used debug to examine the source code of PyPDF, and in the/Font->/Encoding->/Differences
mapping table, characters are mapped to special encodings as follows:{'/Differences': [35, '/G23', 36, '/G24', 37, '/G25', 38, '/G26', 39, '/G27', 40, '/G28', 41, '/G29', 42, '/G2A', 43, '/G2B', 44, '/G2C', 45, '/G2D', 46, '/G2E', 47, '/G2F', 48, '/G30', 49, '/G31'], '/Type': '/Encoding'}
The font file is decoded using the specified
/Filter: /FlateDecode
under/Font->/FontDescriptor->/FontFile3
, but the font file is garbled.Since Adobe Acrobat can display the text correctly, there must be another way to handle this. I am not very familiar with the structure and protocols of PDF documents, so I am unsure how to resolve this issue.
Environment
Which environment were you using when you encountered the problem?
Code + PDFex
This is a minimal, complete example that shows the issue:
Share here the PDF file(s) that cause the issue.
GB+15322.2-2019.pdf
Traceback
This is the complete traceback I see:
page 3 (start 0):
print result:
The text was updated successfully, but these errors were encountered: