PyPDFParser does not take into account filters returned as arrays. #30098

haroldsnyers · 2025-03-04T13:37:55Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

The following Code langchain_community/document_loaders/parsers/pdf.py

class PyPDFParser(BaseBlobParser):

...

def extract_images_from_page(self, page: pypdf._page.PageObject) -> str:
        """Extract images from a PDF page and get the text using images_to_text.

        Args:
            page: The page object from which to extract images.

        Returns:
            str: The extracted text from the images on the page.
        """
        if not self.images_parser:
            return ""
        from PIL import Image

        if "/XObject" not in cast(dict, page["/Resources"]).keys():
            return ""

        xObject = page["/Resources"]["/XObject"].get_object()  # type: ignore[index]
        images = []
        for obj in xObject:
            print(f"pdf object {obj} - {xObject[obj]}")
            np_image: Any = None
            if xObject[obj]["/Subtype"] == "/Image":
                if xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITHOUT_LOSS:
                    height, width = xObject[obj]["/Height"], xObject[obj]["/Width"]

                    np_image = np.frombuffer(
                        xObject[obj].get_data(), dtype=np.uint8
                    ).reshape(height, width, -1)
                elif xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITH_LOSS:
                    np_image = np.array(Image.open(io.BytesIO(xObject[obj].get_data())))

                else:
                    logger.warning("Unknown PDF Filter!")
                if np_image is not None:
                    image_bytes = io.BytesIO()
                    Image.fromarray(np_image).save(image_bytes, format="PNG")
                    blob = Blob.from_data(image_bytes.getvalue(), mime_type="image/png")
                    image_text = next(self.images_parser.lazy_parse(blob)).page_content
                    images.append(
                        _format_inner_image(blob, image_text, self.images_inner_format)
                    )
        return _FORMAT_IMAGE_STR.format(
            image_text=_JOIN_IMAGES.join(filter(None, images))
        )

Error Message and Stack Trace (if applicable)

No response

Description

I am trying to parse PDF scans with PyPdfParser and PyTesseract. The image parsing is generating a bug as the the extracted objects for the /Filter returns sometimes an array, sometimes a string.

Examples
pdf object /Im1 - {'/Type': '/XObject', '/Subtype': '/Image', '/Width': 1653, '/Height': 2338, '/BitsPerComponent': 1, '/ColorSpace': '/DeviceGray', '/Filter': '/CCITTFaxDecode', '/DecodeParms': {'/K': -1, '/Columns': 1653, '/Rows': 2338}}
pdf object /I0 - {'/Type': '/XObject', '/Subtype': '/Image', '/Width': 2478, '/Height': 3488, '/BitsPerComponent': 8, '/ColorSpace': '/DeviceRGB', '/Filter': ['/DCTDecode'], '/DecodeParms': [{}]}

Suggested code fix:

if xObject[obj]["/Subtype"] == "/Image":
                filter = xObject[obj]["/Filter"][1:] if type(xObject[obj]["/Filter"]) == generic._base.NameObject else xObject[obj]["/Filter"][0][1:]
                if filter in _PDF_FILTER_WITHOUT_LOSS:
                    height, width = xObject[obj]["/Height"], xObject[obj]["/Width"]

                    np_image = np.frombuffer(
                        xObject[obj].get_data(), dtype=np.uint8
                    ).reshape(height, width, -1)
                elif filter in _PDF_FILTER_WITH_LOSS:
                    np_image = np.array(Image.open(io.BytesIO(xObject[obj].get_data())))

                else:
                    logger.warning("Unknown PDF Filter!")
                if np_image is not None:
                    image_bytes = io.BytesIO()
                    Image.fromarray(np_image).save(image_bytes, format="PNG")
                    blob = Blob.from_data(image_bytes.getvalue(), mime_type="image/png")
                    image_text = next(self.images_parser.lazy_parse(blob)).page_content
                    images.append(
                        _format_inner_image(blob, image_text, self.images_inner_format)
                    )

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:04 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T8122
Python Version: 3.12.8 (main, Dec 3 2024, 18:42:41) [Clang 16.0.0 (clang-1600.0.26.4)]

Package Information

langchain_core: 0.3.40
langchain: 0.3.19
langchain_community: 0.3.18
langsmith: 0.1.129
langchain_google_community: 2.0.7
langchain_google_vertexai: 2.0.14
langchain_text_splitters: 0.3.6

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
anthropic[vertexai]: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
beautifulsoup4: 4.13.3
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
db-dtypes: Installed. No version info available.
gapic-google-longrunning: Installed. No version info available.
google-api-core: 2.24.1
google-api-python-client: 2.162.0
google-auth: 2.38.0
google-auth-httplib2: 0.2.0
google-auth-oauthlib: Installed. No version info available.
google-cloud-aiplatform: 1.82.0
google-cloud-bigquery: 3.30.0
google-cloud-bigquery-storage: Installed. No version info available.
google-cloud-contentwarehouse: Installed. No version info available.
google-cloud-core: 2.4.2
google-cloud-discoveryengine: Installed. No version info available.
google-cloud-documentai: Installed. No version info available.
google-cloud-documentai-toolbox: Installed. No version info available.
google-cloud-speech: Installed. No version info available.
google-cloud-storage: 2.19.0
google-cloud-texttospeech: Installed. No version info available.
google-cloud-translate: Installed. No version info available.
google-cloud-vision: Installed. No version info available.
googlemaps: Installed. No version info available.
grpcio: 1.70.0
httpx: 0.28.1
httpx-sse: 0.4.0
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.34: Installed. No version info available.
langchain-core<1.0.0,>=0.3.35: Installed. No version info available.
langchain-core<1.0.0,>=0.3.37: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.6: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.19: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy<2,>=1.26.4;: Installed. No version info available.
numpy<3,>=1.26.2;: Installed. No version info available.
orjson: 3.10.15
packaging<25,>=23.2: Installed. No version info available.
pandas: 2.1.1
pyarrow: 19.0.1
pydantic: 2.10.6
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests<3,>=2: Installed. No version info available.
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.

eyurtsev · 2025-03-07T03:57:13Z

cc @pprados -- any idea?

pprados · 2025-03-07T09:14:36Z

@eyurtsev
I will take this ticket.

pprados · 2025-03-10T08:40:18Z

@haroldsnyers
Thank you for the suggested correction. Can you send me an example of a file that fails?

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyPDFParser does not take into account filters returned as arrays. #30098

PyPDFParser does not take into account filters returned as arrays. #30098

haroldsnyers commented Mar 4, 2025

eyurtsev commented Mar 7, 2025

pprados commented Mar 7, 2025

pprados commented Mar 10, 2025

PyPDFParser does not take into account filters returned as arrays. #30098

PyPDFParser does not take into account filters returned as arrays. #30098

Comments

haroldsnyers commented Mar 4, 2025

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

eyurtsev commented Mar 7, 2025

pprados commented Mar 7, 2025

pprados commented Mar 10, 2025