Handling of root objects without a Type #3164

stefan6419846 · 2025-02-28T10:44:23Z

I am currently trying to handle some partially broken PDF files which have root objects not carrying a /Type, thus failing

pypdf/pypdf/_reader.py

Lines 210 to 211 in b7f3811

    
           cast(DictionaryObject, cast(PdfObject, root).get_object()).get("/Type") 
        
           == "/Catalog"

and finally

pypdf/pypdf/_reader.py

Line 226 in b7f3811

if isinstance(o, DictionaryObject) and o.get("/Type") == "/Catalog":

and running into

pypdf/pypdf/_reader.py

Line 230 in b7f3811

if self._validated_root is None:

Just doing self._validated_root = root.get_object() as a fallback seems to work in this case, but probably has other side effects.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.4.0-150600.23.38-default-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.3.0, crypt_provider=('cryptography', '44.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('file.pdf')
page = reader.pages[0]

This should be easy enough to reproduce by tampering with a valid PDF file, while the original file contains confidential information. The relevant root object:

2 0 obj
<<
/Pages 3 0 R
/Metadata 4 0 R
>>
endobj

{'/Pages': IndirectObject(3, 0, 140442733989840), '/Metadata': IndirectObject(4, 0, 140442733989840)}

Traceback

This is the complete traceback I see (line numbers might be off):

WARNING:pypdf._reader:Invalid Root object in trailer
WARNING:pypdf._reader:Searching object with "/Catalog" key
WARNING:pypdf._reader:Object 44 0 found
Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf/run.py", line 12, in <module>
    page = reader.pages[0]
           ~~~~~~~~~~~~^^^
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2524, in __getitem__
    len_self = len(self)
               ^^^^^^^^^
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2505, in __len__
    return self.length_function()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/pypdf/pypdf/_doc_common.py", line 357, in get_num_pages
    self._flatten(self._readonly)
  File "/home/stefan/tmp/pypdf/pypdf/_doc_common.py", line 1161, in _flatten
    catalog = self.root_object
              ^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/pypdf/pypdf/_reader.py", line 234, in root_object
    raise PdfReadError("Cannot find Root object in pdf")
pypdf.errors.PdfReadError: Cannot find Root object in pdf

The text was updated successfully, but these errors were encountered:

Closes py-pdf#3164.

stefan6419846 added PdfReader The PdfReader component is affected is-robustness-issue From a users perspective, this is about robustness labels Feb 28, 2025

stefan6419846 added a commit to stefan6419846/pypdf that referenced this issue Mar 12, 2025

ROB: Consider root objects with catalog type as fallback

c17f4e0

Closes py-pdf#3164.

stefan6419846 linked a pull request Mar 12, 2025 that will close this issue

ROB: Consider root objects without catalog type as fallback #3175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of root objects without a Type #3164

Handling of root objects without a Type #3164

stefan6419846 commented Feb 28, 2025

Handling of root objects without a Type #3164

Handling of root objects without a Type #3164

Comments

stefan6419846 commented Feb 28, 2025

Environment

Code + PDF

Traceback