You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
The following Code langchain_community/document_loaders/parsers/pdf.py
classPyPDFParser(BaseBlobParser):
...
defextract_images_from_page(self, page: pypdf._page.PageObject) ->str:
"""Extract images from a PDF page and get the text using images_to_text. Args: page: The page object from which to extract images. Returns: str: The extracted text from the images on the page. """ifnotself.images_parser:
return""fromPILimportImageif"/XObject"notincast(dict, page["/Resources"]).keys():
return""xObject=page["/Resources"]["/XObject"].get_object() # type: ignore[index]images= []
forobjinxObject:
print(f"pdf object {obj} - {xObject[obj]}")
np_image: Any=NoneifxObject[obj]["/Subtype"] =="/Image":
ifxObject[obj]["/Filter"][1:] in_PDF_FILTER_WITHOUT_LOSS:
height, width=xObject[obj]["/Height"], xObject[obj]["/Width"]
np_image=np.frombuffer(
xObject[obj].get_data(), dtype=np.uint8
).reshape(height, width, -1)
elifxObject[obj]["/Filter"][1:] in_PDF_FILTER_WITH_LOSS:
np_image=np.array(Image.open(io.BytesIO(xObject[obj].get_data())))
else:
logger.warning("Unknown PDF Filter!")
ifnp_imageisnotNone:
image_bytes=io.BytesIO()
Image.fromarray(np_image).save(image_bytes, format="PNG")
blob=Blob.from_data(image_bytes.getvalue(), mime_type="image/png")
image_text=next(self.images_parser.lazy_parse(blob)).page_contentimages.append(
_format_inner_image(blob, image_text, self.images_inner_format)
)
return_FORMAT_IMAGE_STR.format(
image_text=_JOIN_IMAGES.join(filter(None, images))
)
Error Message and Stack Trace (if applicable)
No response
Description
I am trying to parse PDF scans with PyPdfParser and PyTesseract. The image parsing is generating a bug as the the extracted objects for the /Filter returns sometimes an array, sometimes a string.
aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
anthropic[vertexai]: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
beautifulsoup4: 4.13.3
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
db-dtypes: Installed. No version info available.
gapic-google-longrunning: Installed. No version info available.
google-api-core: 2.24.1
google-api-python-client: 2.162.0
google-auth: 2.38.0
google-auth-httplib2: 0.2.0
google-auth-oauthlib: Installed. No version info available.
google-cloud-aiplatform: 1.82.0
google-cloud-bigquery: 3.30.0
google-cloud-bigquery-storage: Installed. No version info available.
google-cloud-contentwarehouse: Installed. No version info available.
google-cloud-core: 2.4.2
google-cloud-discoveryengine: Installed. No version info available.
google-cloud-documentai: Installed. No version info available.
google-cloud-documentai-toolbox: Installed. No version info available.
google-cloud-speech: Installed. No version info available.
google-cloud-storage: 2.19.0
google-cloud-texttospeech: Installed. No version info available.
google-cloud-translate: Installed. No version info available.
google-cloud-vision: Installed. No version info available.
googlemaps: Installed. No version info available.
grpcio: 1.70.0
httpx: 0.28.1
httpx-sse: 0.4.0
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.34: Installed. No version info available.
langchain-core<1.0.0,>=0.3.35: Installed. No version info available.
langchain-core<1.0.0,>=0.3.37: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.6: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.19: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy<2,>=1.26.4;: Installed. No version info available.
numpy<3,>=1.26.2;: Installed. No version info available.
orjson: 3.10.15
packaging<25,>=23.2: Installed. No version info available.
pandas: 2.1.1
pyarrow: 19.0.1
pydantic: 2.10.6
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests<3,>=2: Installed. No version info available.
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
The text was updated successfully, but these errors were encountered:
dosubotbot
added
the
🤖:bug
Related to a bug, vulnerability, unexpected error with an existing feature
label
Mar 4, 2025
Checked other resources
Example Code
The following Code langchain_community/document_loaders/parsers/pdf.py
Error Message and Stack Trace (if applicable)
No response
Description
I am trying to parse PDF scans with PyPdfParser and PyTesseract. The image parsing is generating a bug as the the extracted objects for the /Filter returns sometimes an array, sometimes a string.
Examples
pdf object /Im1 - {'/Type': '/XObject', '/Subtype': '/Image', '/Width': 1653, '/Height': 2338, '/BitsPerComponent': 1, '/ColorSpace': '/DeviceGray', '/Filter': '/CCITTFaxDecode', '/DecodeParms': {'/K': -1, '/Columns': 1653, '/Rows': 2338}}
pdf object /I0 - {'/Type': '/XObject', '/Subtype': '/Image', '/Width': 2478, '/Height': 3488, '/BitsPerComponent': 8, '/ColorSpace': '/DeviceRGB', '/Filter': ['/DCTDecode'], '/DecodeParms': [{}]}
Suggested code fix:
System Info
System Information
Package Information
Optional packages not installed
Other Dependencies
The text was updated successfully, but these errors were encountered: