Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error referencing a non-existent page destination when writing PDF #2842

Closed
ssjkamei opened this issue Sep 13, 2024 · 2 comments · Fixed by #2857
Closed

Error referencing a non-existent page destination when writing PDF #2842

ssjkamei opened this issue Sep 13, 2024 · 2 comments · Fixed by #2857
Labels
is-robustness-issue From a users perspective, this is about robustness PdfWriter The PdfWriter component is affected

Comments

@ssjkamei
Copy link

Errors occur in the PDF writing process.

Environment

Which environment were you using when you encountered the problem?

> python -m platform
Windows-10-10.0.22631-SP0

> python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('cryptography', '41.0.5'), PIL=10.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfWriter, PdfReader

def test_write_pdf():
    filepath = r"C:\test.pdf"
    with open(filepath, "rb") as f:
        pdf_writer = PdfWriter()
        pdf_reader = PdfReader(f, True)
        print(pdf_reader.metadata)
        print(pdf_reader.named_destinations)
        pdf_writer.append(pdf_reader)

Sorry we are unable to provide the PDF.
We are in the process of confirming that we can create a PDF that can be published without any problems.

Traceback

This is the complete traceback I see:

venv\venv\Lib\site-packages\pypdf\_writer.py:2365: in append
    self.merge(
venv\venv\Lib\site-packages\pypdf\_writer.py:2474: in merge
    p = reader.pages[dest["/Page"]]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pypdf._page._VirtualList object at 0x0000025E7C033650>, index = 1

    def __getitem__(
        self, index: Union[int, slice]
    ) -> Union[PageObject, Sequence[PageObject]]:
        if isinstance(index, slice):
            indices = range(*index.indices(len(self)))
            cls = type(self)
            return cls(indices.__len__, lambda idx: self[indices[idx]])
        if not isinstance(index, int):
            raise TypeError("sequence indices must be integers")
        len_self = len(self)
        if index < 0:
            # support negative indexes
            index = len_self + index
        if index < 0 or index >= len_self:
>           raise IndexError("sequence index out of range")
E           IndexError: sequence index out of range

Perhaps the following are causing the problem. 46 0 obj and 20 0 obj do not exist.
I tried to fix them in Adobe Acrobat, but could not figure out how to turn them off.

30 0 obj
<</AcroForm 46 0 R/Dests 20 0 R/Extensions<</ADBE<</BaseVersion/1.7/ExtensionLevel 8>>>>/Metadata 5 0 R/Names 47 0 R/OCProperties<</D<</OFF[]/Order[]/RBGroups[]>>/OCGs[48 0 R 49 0 R 50 0 R]>>/Pages 18 0 R/StructTreeRoot 14 0 R/Type/Catalog>>
endobj

When read in PdfReader, the following will be generated in named_destinations. (pdf has only one page)

{'/__WKANCHOR_2': {'/Title': '/__WKANCHOR_2', '/Page': 0, '/Type': '/XYZ', '/Left': 36, '/Top': 754, '/Zoom': 0.0}, '/__WKANCHOR_4': {'/Title': '/__WKANCHOR_4', '/Page': 0, '/Type': '/XYZ', '/Left': 305, '/Top': 754, '/Zoom': 0.0}, '/__WKANCHOR_6': {'/Title': '/__WKANCHOR_6', '/Page': 0, '/Type': '/XYZ', '/Left': 36, '/Top': 454, '/Zoom': 0.0}, '/__WKANCHOR_8': {'/Title': '/__WKANCHOR_8', '/Page': 1, '/Type': '/XYZ', '/Left': 61, '/Top': 802, '/Zoom': 0.0}, '/__WKANCHOR_a': {'/Title': '/__WKANCHOR_a', '/Page': 1, '/Type': '/XYZ', '/Left': 36, '/Top': 425, '/Zoom': 0.0}, '/__WKANCHOR_c': {'/Title': '/__WKANCHOR_c', '/Page': 2, '/Type': '/XYZ', '/Left': 36, '/Top': 814, '/Zoom': 0.0}, '/__WKANCHOR_e': {'/Title': '/__WKANCHOR_e', '/Page': 2, '/Type': '/XYZ', '/Left': 36, '/Top': 703, '/Zoom': 0.0}}

I was able to avoid the error by adding if len(reader.pages) > dest[“/Page”]: on the PdfWriter side.

pypdf/pypdf/_writer.py

Lines 2471 to 2482 in 8f62120

elif isinstance(dest["/Page"], int):
# the page reference is a page number normally not a PDF Reference
# page numbers as int are normally accepted only in external goto
p = reader.pages[dest["/Page"]]
assert p.indirect_reference is not None
try:
arr[NumberObject(0)] = NumberObject(
srcpages[p.indirect_reference.idnum].page_number
)
self.add_named_destination_array(dest["/Title"], arr)
except KeyError:
pass

            elif isinstance(dest["/Page"], int):
                # the page reference is a page number normally not a PDF Reference
                # page numbers as int are normally accepted only in external goto
                if len(reader.pages) > dest["/Page"]:
                    p = reader.pages[dest["/Page"]]
                    assert p.indirect_reference is not None
                    try:
                        arr[NumberObject(0)] = NumberObject(
                            srcpages[p.indirect_reference.idnum].page_number
                        )
                        self.add_named_destination_array(dest["/Title"], arr)
                    except KeyError:
                        pass

A Dests reference is created in the resulting PDF as follows.

15 0 obj
<<
/Dests 16 0 R
>>
endobj
16 0 obj
<<
/Names [ (\057\137\137WKANCHOR\1372) [ 0 /XYZ 36 754 0.0 ] (\057\137\137WKANCHOR\1374) [ 0 /XYZ 305 754 0.0 ] (\057\137\137WKANCHOR\1376) [ 0 /XYZ 36 454 0.0 ] ]
>>
@pubpub-zz
Copy link
Collaborator

I'm confused about your code : what do you intend to do? I would have passed the pdf stream or better the pdf file name as the argument at PdfWriter creation
Can you check if this works better?

@stefan6419846 stefan6419846 added PdfWriter The PdfWriter component is affected is-robustness-issue From a users perspective, this is about robustness needs-pdf The issue needs a PDF file to show the problem labels Sep 13, 2024
@ssjkamei
Copy link
Author

I could not delete the font data, but here is the relevant PDF.
test.pdf

I'm confused about your code : what do you intend to do? I would have passed the pdf stream or better the pdf file name as the argument at PdfWriter creation
Can you check if this works better?

It does not work well.
Originally, it was specified directly to PdfWriter. I followed the code and confirmed that it is converting to PdfReader internally.

The problem is probably the content of named_destinations, so an error occurs when writing out, but I think the root cause is in PdfReader, so I have made the code like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness PdfWriter The PdfWriter component is affected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants