Skip to content

fix: avoid native crash when extracting PDF images#409

Open
the-waste-land wants to merge 3 commits intoopendataloader-project:mainfrom
the-waste-land:fix-pdfbox-image-extraction-crash
Open

fix: avoid native crash when extracting PDF images#409
the-waste-land wants to merge 3 commits intoopendataloader-project:mainfrom
the-waste-land:fix-pdfbox-image-extraction-crash

Conversation

@the-waste-land
Copy link
Copy Markdown

@the-waste-land the-waste-land commented Apr 11, 2026

This replaces the old page sub-image extraction path with direct PDFBox image-object extraction, then decodes PDImageXObject instances with getOpaqueImage() to avoid native crashes on PDFs that contain transparency/soft masks.

It also closes the PDFBox document through try-with-resources and adds a regression test that verifies image extraction against a scanned PDF sample.

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Verification:

  • mvn -B -Dtest=ImagesUtilsTest -Dmaven.compiler.source=11 -Dmaven.compiler.target=11 -Dmaven.repo.local=/Users/liweixin/code/opendataloader-pdf/.m2/repository -pl opendataloader-pdf-core -f java/pom.xml test
  • mvn -B package -DskipTests -Dmaven.compiler.source=11 -Dmaven.compiler.target=11 -Dmaven.repo.local=/Users/liweixin/code/opendataloader-pdf/.m2/repository -f java/pom.xml
  • java -jar java/opendataloader-pdf-cli/target/opendataloader-pdf-cli-0.0.0.jar -q -f markdown --image-output external -o /tmp/odl-ext-pr "docs/M24VAH-产品规格书_V2.0-20250722.pdf"
  • java -jar java/opendataloader-pdf-cli/target/opendataloader-pdf-cli-0.0.0.jar -q -f markdown --image-output embedded -o /tmp/odl-emb-pr "docs/M24VAH-产品规格书_V2.0-20250722.pdf"

Note: scripts/bench.sh currently exits before benchmark execution with "ARGS[@]: unbound variable" when no extra args are passed, so I could not complete that repository-level check from this branch.

Summary by CodeRabbit

  • Bug Fixes
    • Improved resource management for PDF image extraction (automatic closing and per-page cache release) to prevent leaks and lower memory use.
  • Enhancements
    • More accurate image selection for extraction using overlap-based matching and support for password-protected PDFs.
  • Tests
    • Updated tests to validate extracted image files and their dimensions rather than internal state.

Signed-off-by: liweixin <liweixin@cvte.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 11, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 11, 2026

Walkthrough

DocumentProcessor now uses try-with-resources to manage ImagesUtils. ImagesUtils implements AutoCloseable, replaces contrast-ratio consumer logic with a PDFBox-driven per-page image extraction/caching pipeline (PageImageCollector / PageImageReference) and adds explicit release/close methods. Tests updated to validate PNG extraction.

Changes

Cohort / File(s) Summary
Images utility core
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java
Now implements AutoCloseable; removes ContrastRatioConsumer accessor; adds lazy-loaded PDDocument, per-page pageImagesByNumber cache, PageImageReference and PageImageCollector (PDFGraphicsStreamEngine); introduces extractPageImage(...), per-page and full cache release methods, closePdfDocument() and close() overriding AutoCloseable; image selection picks best candidate by intersection area with tie-breaker; resource cleanup around per-page processing.
Resource management in processing
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
Wraps ImagesUtils usage in try-with-resources within generateOutputs(...), moving image write calls into the try block so ImagesUtils is closed automatically.
Tests
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java
Renamed/rewritten test to use try-with-resources, switch sample to chinese_scan.pdf, set image format to PNG, derive bounding box from page crop box, and assert output PNG exists and loads with non-zero dimensions; removed contrast-ratio-related assertions and cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant DocumentProcessor
    participant ImagesUtils
    participant PDDocument
    participant PageImageCollector
    participant ImageCache

    DocumentProcessor->>ImagesUtils: new ImagesUtils()
    DocumentProcessor->>ImagesUtils: writeImage(chunk, pdfPath, password)
    Note right of ImagesUtils: extractPageImage(imageBox, pdfPath, password)
    ImagesUtils->>PDDocument: load or reuse PDDocument (lazy)
    ImagesUtils->>PageImageCollector: enumerate images on page
    PageImageCollector->>PDDocument: parse page content / drawImage events
    PageImageCollector-->>ImagesUtils: return PageImageReference list
    ImagesUtils->>ImageCache: select best candidate (intersection area + tie-breaker), mark used
    ImagesUtils-->>DocumentProcessor: return BufferedImage / write file
    DocumentProcessor->>ImagesUtils: close() (end of try-with-resources)
    ImagesUtils->>ImageCache: release page images / clear cache
    ImagesUtils->>PDDocument: close document
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: avoid native crash when extracting PDF images' accurately describes the main change: replacing the previous image extraction path with PDFBox image-object extraction to prevent native crashes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`:
- Around line 53-55: The pageImagesByNumber map retains PageImageReference
(which can hold decoded BufferedImage) until close(), causing one
full-resolution raster per processed page to accumulate; update the logic in
ImagesUtils to bound the cache to a single page: change usages around
pageImagesByNumber, PageImageReference, write(), and close() so that you only
keep entries for the currently-processing page (evict/remove previous page
entries and call any image disposal/flush on the PageImageReference when moving
to the next page), and ensure close() and any error paths also clear and null
out the map and release buffered images; apply the same single-page eviction
pattern to the other related code blocks referenced (lines ~63-69, ~182-193,
~269-277).
- Around line 217-225: ensurePdfDocument currently returns early if
pdfBoxDocument is non-null, causing reuse of the first-loaded PDF for other
files; change it to track the currently loaded source (add fields like
currentPdfPath and currentPassword) and, in ensurePdfDocument, compare the
requested pdfFilePath/password to those fields—if they differ, call
pdfBoxDocument.close() (or safely release it), update the tracking fields, and
load the new document via Loader.loadPDF (respecting the password branch); if
they match, return early as before. Ensure IOException handling around
Loader.loadPDF and closing.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java`:
- Around line 80-97: The test relies on a mutable StaticLayoutContainers image
format but never sets it, causing brittle assumptions about "imageFile1.png";
update ImagesUtilsTest to explicitly set the image format before writing (call
StaticLayoutContainers.setImageFormat("png") prior to imagesUtils.writeImage or
otherwise derive the expected filename from
StaticLayoutContainers.getImageFormat()), so the assertion for
Path.of(StaticLayoutContainers.getImagesDirectory(), "imageFile1.png") always
matches the actual output format.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d9df0134-7038-44f4-ab33-61fcc469b2fe

📥 Commits

Reviewing files that changed from the base of the PR and between 85fa506 and 5b262e6.

📒 Files selected for processing (3)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java

Signed-off-by: liweixin <liweixin@cvte.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`:
- Around line 277-285: The close() method in ImagesUtils currently swallows
IOException from closePdfDocument() which hides cleanup failures; change the
method to propagate the exception by updating the signature to "public void
close() throws IOException", remove the try/catch around closePdfDocument() so
any IOException from closePdfDocument() bubbles up (after still calling
releaseAllPageImages()), and ensure callers (e.g., DocumentProcessor) handle or
propagate the IOException accordingly; alternatively, if you prefer unchecked,
wrap the caught IOException in a RuntimeException and rethrow from close(), but
do not silently log-and-suppress.
- Around line 343-355: drawImage currently stores full transformed image bounds
without applying the page clipping path, and clip()/appendRectangle() are
no-ops; update clip() and appendRectangle() to track/maintain the current
clipping path used by getGraphicsState().getCurrentClippingPath(), then in
drawImage(PDImage pdImage) obtain that clipping shape, transform it into the
same coordinate space as transformedPoint(...) and intersect it with the
computed left/right/top/bottom box (or skip storing if the intersection is
empty); finally create the PageImageReference with the intersected BoundingBox
(falling back to the original transformed bounds only if no clipping path is
present) so findBestPageImage() matches the visible, clipped image region.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java`:
- Around line 97-98: The test currently only checks file existence for
imageFile1.png in ImagesUtilsTest; enhance it to verify the file is a
readable/valid image by opening it with ImageIO.read and asserting the returned
BufferedImage is non-null (and optionally has width/height > 0) to prevent
passing on zero-byte or corrupted files—locate the pngPath usage in
ImagesUtilsTest and replace or augment the Files.exists assertion with an
ImageIO.read-based assertion (e.g., assertNotNull(image) and
assertTrue(image.getWidth() > 0 && image.getHeight() > 0)).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d7ace6cf-a7e1-43cf-babc-8fcb7e00436b

📥 Commits

Reviewing files that changed from the base of the PR and between 5b262e6 and 7c0f074.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java

Comment on lines +343 to +355
public void drawImage(PDImage pdImage) {
Point2D lowerLeft = transformedPoint(0, 0);
Point2D lowerRight = transformedPoint(1, 0);
Point2D upperLeft = transformedPoint(0, 1);
Point2D upperRight = transformedPoint(1, 1);

double left = Math.min(Math.min(lowerLeft.getX(), lowerRight.getX()), Math.min(upperLeft.getX(), upperRight.getX()));
double right = Math.max(Math.max(lowerLeft.getX(), lowerRight.getX()), Math.max(upperLeft.getX(), upperRight.getX()));
double bottom = Math.min(Math.min(lowerLeft.getY(), lowerRight.getY()), Math.min(upperLeft.getY(), upperRight.getY()));
double top = Math.max(Math.max(lowerLeft.getY(), lowerRight.getY()), Math.max(upperLeft.getY(), upperRight.getY()));

pageImages.add(new PageImageReference(pdImage, new BoundingBox(pageIndex, left, bottom, right, top)));
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f -name "ImagesUtils.java"

Repository: opendataloader-project/opendataloader-pdf

Length of output: 315


🏁 Script executed:

fd "ImagesUtils.java"

Repository: opendataloader-project/opendataloader-pdf

Length of output: 173


🏁 Script executed:

wc -l java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 177


🏁 Script executed:

sed -n '330,380p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 2497


🏁 Script executed:

sed -n '1,100p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 5261


🏁 Script executed:

sed -n '300,370p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 3141


🏁 Script executed:

sed -n '330,409p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 3229


🏁 Script executed:

rg "getClippingPath\|getGraphicsState\|clipping" java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 67


🏁 Script executed:

rg "PDFGraphicsStreamEngine" --type java -A 20 | head -50

Repository: opendataloader-project/opendataloader-pdf

Length of output: 6165


🏁 Script executed:

web_search
PDFGraphicsStreamEngine clipping state API

Repository: opendataloader-project/opendataloader-pdf

Length of output: 197


🌐 Web query:

PDFGraphicsStreamEngine clipping path getClippingPath method PDFBox

💡 Result:

No, the PDFGraphicsStreamEngine class in Apache PDFBox does not have a direct getClippingPath method. PDFGraphicsStreamEngine is an abstract subclass of PDFStreamEngine designed for advanced graphics processing, where subclasses must implement abstract methods like clip(int windingRule), endPath, etc., to handle path construction and clipping operations. To access the current clipping path: - Call getGraphicsState (inherited from PDFStreamEngine), which returns a PDGraphicsState object. - On PDGraphicsState, call getCurrentClippingPath, which returns an java.awt.geom.Area representing the current clipping path. Note: Do not modify the returned Area directly. This is confirmed across multiple official PDFBox API documentations (versions 2.0.7 through 3.0-alpha3) and source code analysis. Subclasses like CustomGraphicsStreamEngine (an example in PDFBox) do not override or add a getClippingPath method but inherit access via getGraphicsState.

Citations:


🏁 Script executed:

rg "pageImages" java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | head -20

Repository: opendataloader-project/opendataloader-pdf

Length of output: 1189


🏁 Script executed:

sed -n '100,200p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | grep -A 15 "getPageImages"

Repository: opendataloader-project/opendataloader-pdf

Length of output: 1486


🏁 Script executed:

sed -n '100,250p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 7659


🏁 Script executed:

sed -n '260,330p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 2518


Account for PDF clipping when caching image bounds.

The drawImage() method calculates full transformed bounds but ignores any clipping state. The clip() and appendRectangle() methods that should track clipping paths are empty no-ops. When PDFs contain clipped or masked images, the stored bounds will be larger than the visible region, causing findBestPageImage() to match incorrectly and potentially export pixels outside the clipped area. Use getGraphicsState().getCurrentClippingPath() to intersect the transformed bounds with the active clipping region before storing the BoundingBox.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`
around lines 343 - 355, drawImage currently stores full transformed image bounds
without applying the page clipping path, and clip()/appendRectangle() are
no-ops; update clip() and appendRectangle() to track/maintain the current
clipping path used by getGraphicsState().getCurrentClippingPath(), then in
drawImage(PDImage pdImage) obtain that clipping shape, transform it into the
same coordinate space as transformedPoint(...) and intersect it with the
computed left/right/top/bottom box (or skip storing if the intersection is
empty); finally create the PageImageReference with the intersected BoundingBox
(falling back to the original transformed bounds only if no clipping path is
present) so findBestPageImage() matches the visible, clipped image region.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is scoped to the native crash during image extraction. The current change replaces the old sub-image path with direct PDFBox image-object extraction and keeps the matching logic minimal for that bugfix.

Clipping-aware bounds would change the image-matching semantics and needs representative clipped-image samples to validate behavior across PDFs, so I am leaving that out of this PR. If maintainers want it, I can follow up separately with targeted samples and tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-waste-land, that's a fair and well-reasoned scope decision. The native-crash fix stands on its own and the clipping-aware matching is a separate concern that warrants dedicated samples and validation.

Would you like me to open a follow-up GitHub issue to track the clipping-aware bounds improvement so it isn't lost? I can include a backlink to this PR and the relevant context.


✏️ Learnings added
Learnt from: the-waste-land
Repo: opendataloader-project/opendataloader-pdf PR: 409
File: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java:339-351
Timestamp: 2026-04-11T14:50:32.262Z
Learning: In `ImagesUtils.java` (Java, opendataloader-pdf), the `drawImage()` method in `PageImageCollector` intentionally stores full transformed bounds without applying the clipping path. The PR author (the-waste-land) explicitly deferred clipping-aware bound intersection to a follow-up, as it requires representative clipped-image PDF samples and targeted tests. Do not flag this as a bug in PRs that are scoped only to the native-crash fix (direct PDFBox image-object extraction path).

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: hamid17amu
Repo: opendataloader-project/opendataloader-pdf PR: 0
File: :0-0
Timestamp: 2026-04-11T10:19:08.964Z
Learning: In `DocumentProcessor.java` (Java, opendataloader-pdf), `StaticLayoutContainers.closeContrastRatioConsumer()` does not expose verification APIs (e.g., `getContrastRatioConsumer()` or `hasContrastRatioConsumer()`). Verifying the closed state of the consumer would require changes to `StaticLayoutContainers` internals, which is out of scope for resource-cleanup-only changes.

Signed-off-by: liweixin <liweixin@cvte.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`:
- Around line 119-127: The writePicture method depends on
picture.getPictureIndex() == 1 to create the images directory which is brittle;
instead, always ensure the directory exists before writing: call
createImagesDirectory(StaticLayoutContainers.getImagesDirectory())
unconditionally at the start of writePicture (or explicitly check File dir = new
File(StaticLayoutContainers.getImagesDirectory()) and create it if
!dir.exists()) so createImageFile won’t fail when indices are non-sequential;
update the writePicture method (referenced symbols: writePicture,
createImagesDirectory, StaticLayoutContainers.getImagesDirectory,
createImageFile, SemanticPicture.getPictureIndex) accordingly.
- Around line 338-359: drawImage currently records the full transformed image
bbox via PageImageReference and BoundingBox without considering any clipping;
implement tracking of the current clipping region (update state in
appendRectangle and clip) and intersect that clip with the transformed bbox
before adding to pageImages so stored bounds reflect visible area (use methods
around drawImage, appendRectangle, clip, pageImages, PageImageReference,
BoundingBox and pageIndex to locate code); if you decide not to support
clipping, add a clear TODO/comment in drawImage explaining that
appendRectangle/clip are no-ops and that stored bounds ignore clipping so
callers must handle clipped images.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cfc44069-d81c-4625-ad2d-2b69efeaf401

📥 Commits

Reviewing files that changed from the base of the PR and between 7c0f074 and 2949c5f.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java

Comment on lines 119 to 127
protected void writePicture(SemanticPicture picture, String pdfFilePath, String password) {
int pictureIndex = picture.getPictureIndex();
if (contrastRatioConsumer == null) {
if (pictureIndex == 1) {
createImagesDirectory(StaticLayoutContainers.getImagesDirectory());
contrastRatioConsumer = StaticLayoutContainers.getContrastRatioConsumer(pdfFilePath, password, false, null);
}
String imageFormat = StaticLayoutContainers.getImageFormat();
String fileName = String.format(MarkdownSyntax.IMAGE_FILE_NAME_FORMAT, StaticLayoutContainers.getImagesDirectory(), File.separator, pictureIndex, imageFormat);
createImageFile(picture.getBoundingBox(), fileName, imageFormat);
createImageFile(picture.getBoundingBox(), fileName, imageFormat, pdfFilePath, password);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Directory creation relies on external picture indexing.

The writePicture method assumes picture.getPictureIndex() returns 1 for the first picture to trigger directory creation. If SemanticPicture indices don't start at 1 or are not sequential, the images directory may not be created before attempting to write the first picture file.

Consider using a flag or checking if the directory exists rather than relying on index values.

Suggested improvement
 protected void writePicture(SemanticPicture picture, String pdfFilePath, String password) {
     int pictureIndex = picture.getPictureIndex();
-    if (pictureIndex == 1) {
-        createImagesDirectory(StaticLayoutContainers.getImagesDirectory());
-    }
+    createImagesDirectory(StaticLayoutContainers.getImagesDirectory());
     String imageFormat = StaticLayoutContainers.getImageFormat();

Note: createImagesDirectory already checks !directory.exists() so calling it unconditionally is safe and more robust.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`
around lines 119 - 127, The writePicture method depends on
picture.getPictureIndex() == 1 to create the images directory which is brittle;
instead, always ensure the directory exists before writing: call
createImagesDirectory(StaticLayoutContainers.getImagesDirectory())
unconditionally at the start of writePicture (or explicitly check File dir = new
File(StaticLayoutContainers.getImagesDirectory()) and create it if
!dir.exists()) so createImageFile won’t fail when indices are non-sequential;
update the writePicture method (referenced symbols: writePicture,
createImagesDirectory, StaticLayoutContainers.getImagesDirectory,
createImageFile, SemanticPicture.getPictureIndex) accordingly.

Comment on lines +338 to +359
@Override
public void drawImage(PDImage pdImage) {
Point2D lowerLeft = transformedPoint(0, 0);
Point2D lowerRight = transformedPoint(1, 0);
Point2D upperLeft = transformedPoint(0, 1);
Point2D upperRight = transformedPoint(1, 1);

double left = Math.min(Math.min(lowerLeft.getX(), lowerRight.getX()), Math.min(upperLeft.getX(), upperRight.getX()));
double right = Math.max(Math.max(lowerLeft.getX(), lowerRight.getX()), Math.max(upperLeft.getX(), upperRight.getX()));
double bottom = Math.min(Math.min(lowerLeft.getY(), lowerRight.getY()), Math.min(upperLeft.getY(), upperRight.getY()));
double top = Math.max(Math.max(lowerLeft.getY(), lowerRight.getY()), Math.max(upperLeft.getY(), upperRight.getY()));

pageImages.add(new PageImageReference(pdImage, new BoundingBox(pageIndex, left, bottom, right, top)));
}

@Override
public void appendRectangle(Point2D point0, Point2D point1, Point2D point2, Point2D point3) {
}

@Override
public void clip(int windingRule) {
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there are any test PDFs or documentation about clipping support
rg -i "clip" --type java -C3 java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/

Repository: opendataloader-project/opendataloader-pdf

Length of output: 809


🏁 Script executed:

wc -l java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 177


🏁 Script executed:

cat -n java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 18564


🏁 Script executed:

rg "ForkJoinPool" java/opendataloader-pdf-core/

Repository: opendataloader-project/opendataloader-pdf

Length of output: 385


🏁 Script executed:

sed -n '1,100p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 4578


🏁 Script executed:

sed -n '100,250p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 7885


🏁 Script executed:

sed -n '250,350p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 5030


🏁 Script executed:

rg -A5 -B5 "new PageImageCollector" java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 516


Clipping path not applied to stored image bounds.

The drawImage() method stores the full transformed bounding box without intersecting it with the active clipping path. The clip() and appendRectangle() methods are no-ops, so clipping state is never tracked. For PDFs with clipped or masked images, the stored bounds will be larger than the visible region, which may cause findBestPageImage() to select incorrect candidates when multiple images overlap the target area.

If this is an acceptable limitation for the current use case (e.g., scanned PDFs typically don't use clipping on images), consider adding a code comment documenting this constraint.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`
around lines 338 - 359, drawImage currently records the full transformed image
bbox via PageImageReference and BoundingBox without considering any clipping;
implement tracking of the current clipping region (update state in
appendRectangle and clip) and intersect that clip with the transformed bbox
before adding to pageImages so stored bounds reflect visible area (use methods
around drawImage, appendRectangle, clip, pageImages, PageImageReference,
BoundingBox and pageIndex to locate code); if you decide not to support
clipping, add a clear TODO/comment in drawImage explaining that
appendRectangle/clip are no-ops and that stored bounds ignore clipping so
callers must handle clipped images.

@bundolee
Copy link
Copy Markdown
Contributor

Here's how the review process will go from here:

  1. CodeRabbit — all review threads resolved with a fix commit
  2. CI — all checks pass (codecov excluded)
  3. Component owner approval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants