Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of root objects without a Type #3164

Open
stefan6419846 opened this issue Feb 28, 2025 · 0 comments · May be fixed by #3175
Open

Handling of root objects without a Type #3164

stefan6419846 opened this issue Feb 28, 2025 · 0 comments · May be fixed by #3175
Labels
is-robustness-issue From a users perspective, this is about robustness PdfReader The PdfReader component is affected

Comments

@stefan6419846
Copy link
Collaborator

I am currently trying to handle some partially broken PDF files which have root objects not carrying a /Type, thus failing

pypdf/pypdf/_reader.py

Lines 210 to 211 in b7f3811

cast(DictionaryObject, cast(PdfObject, root).get_object()).get("/Type")
== "/Catalog"
and finally
if isinstance(o, DictionaryObject) and o.get("/Type") == "/Catalog":
and running into
if self._validated_root is None:
Just doing self._validated_root = root.get_object() as a fallback seems to work in this case, but probably has other side effects.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.4.0-150600.23.38-default-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.3.0, crypt_provider=('cryptography', '44.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('file.pdf')
page = reader.pages[0]

This should be easy enough to reproduce by tampering with a valid PDF file, while the original file contains confidential information. The relevant root object:

2 0 obj
<<
/Pages 3 0 R
/Metadata 4 0 R
>>
endobj
{'/Pages': IndirectObject(3, 0, 140442733989840), '/Metadata': IndirectObject(4, 0, 140442733989840)}

Traceback

This is the complete traceback I see (line numbers might be off):

WARNING:pypdf._reader:Invalid Root object in trailer
WARNING:pypdf._reader:Searching object with "/Catalog" key
WARNING:pypdf._reader:Object 44 0 found
Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf/run.py", line 12, in <module>
    page = reader.pages[0]
           ~~~~~~~~~~~~^^^
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2524, in __getitem__
    len_self = len(self)
               ^^^^^^^^^
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2505, in __len__
    return self.length_function()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/pypdf/pypdf/_doc_common.py", line 357, in get_num_pages
    self._flatten(self._readonly)
  File "/home/stefan/tmp/pypdf/pypdf/_doc_common.py", line 1161, in _flatten
    catalog = self.root_object
              ^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/pypdf/pypdf/_reader.py", line 234, in root_object
    raise PdfReadError("Cannot find Root object in pdf")
pypdf.errors.PdfReadError: Cannot find Root object in pdf
@stefan6419846 stefan6419846 added PdfReader The PdfReader component is affected is-robustness-issue From a users perspective, this is about robustness labels Feb 28, 2025
stefan6419846 added a commit to stefan6419846/pypdf that referenced this issue Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness PdfReader The PdfReader component is affected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant