fix: avoid native crash when extracting PDF images by the-waste-land · Pull Request #409 · opendataloader-project/opendataloader-pdf

the-waste-land · 2026-04-11T02:58:08Z

This replaces the old page sub-image extraction path with direct PDFBox image-object extraction, then decodes PDImageXObject instances with getOpaqueImage() to avoid native crashes on PDFs that contain transparency/soft masks.

It also closes the PDFBox document through try-with-resources and adds a regression test that verifies image extraction against a scanned PDF sample.

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

Verification:

mvn -B -Dtest=ImagesUtilsTest -Dmaven.compiler.source=11 -Dmaven.compiler.target=11 -Dmaven.repo.local=/Users/liweixin/code/opendataloader-pdf/.m2/repository -pl opendataloader-pdf-core -f java/pom.xml test
mvn -B package -DskipTests -Dmaven.compiler.source=11 -Dmaven.compiler.target=11 -Dmaven.repo.local=/Users/liweixin/code/opendataloader-pdf/.m2/repository -f java/pom.xml
java -jar java/opendataloader-pdf-cli/target/opendataloader-pdf-cli-0.0.0.jar -q -f markdown --image-output external -o /tmp/odl-ext-pr "docs/M24VAH-产品规格书_V2.0-20250722.pdf"
java -jar java/opendataloader-pdf-cli/target/opendataloader-pdf-cli-0.0.0.jar -q -f markdown --image-output embedded -o /tmp/odl-emb-pr "docs/M24VAH-产品规格书_V2.0-20250722.pdf"

Note: scripts/bench.sh currently exits before benchmark execution with "ARGS[@]: unbound variable" when no extra args are passed, so I could not complete that repository-level check from this branch.

Summary by CodeRabbit

Bug Fixes
- Improved resource management for PDF image extraction (automatic closing and per-page cache release) to prevent leaks and lower memory use.
Enhancements
- More accurate image selection for extraction using overlap-based matching and support for password-protected PDFs.
Tests
- Updated tests to validate extracted image files and their dimensions rather than internal state.

Signed-off-by: liweixin <liweixin@cvte.com>

CLAassistant · 2026-04-11T02:58:14Z

All committers have signed the CLA.

coderabbitai · 2026-04-11T02:58:21Z

Walkthrough

DocumentProcessor now uses try-with-resources to manage ImagesUtils. ImagesUtils implements AutoCloseable, replaces contrast-ratio consumer logic with a PDFBox-driven per-page image extraction/caching pipeline (PageImageCollector / PageImageReference) and adds explicit release/close methods. Tests updated to validate PNG extraction.

Changes

Cohort / File(s)	Summary
Images utility core `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`	Now implements `AutoCloseable`; removes `ContrastRatioConsumer` accessor; adds lazy-loaded `PDDocument`, per-page `pageImagesByNumber` cache, `PageImageReference` and `PageImageCollector` (PDFGraphicsStreamEngine); introduces `extractPageImage(...)`, per-page and full cache release methods, `closePdfDocument()` and `close()` overriding `AutoCloseable`; image selection picks best candidate by intersection area with tie-breaker; resource cleanup around per-page processing.
Resource management in processing `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`	Wraps `ImagesUtils` usage in try-with-resources within `generateOutputs(...)`, moving image write calls into the try block so `ImagesUtils` is closed automatically.
Tests `java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java`	Renamed/rewritten test to use try-with-resources, switch sample to `chinese_scan.pdf`, set image format to PNG, derive bounding box from page crop box, and assert output PNG exists and loads with non-zero dimensions; removed contrast-ratio-related assertions and cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant DocumentProcessor
    participant ImagesUtils
    participant PDDocument
    participant PageImageCollector
    participant ImageCache

    DocumentProcessor->>ImagesUtils: new ImagesUtils()
    DocumentProcessor->>ImagesUtils: writeImage(chunk, pdfPath, password)
    Note right of ImagesUtils: extractPageImage(imageBox, pdfPath, password)
    ImagesUtils->>PDDocument: load or reuse PDDocument (lazy)
    ImagesUtils->>PageImageCollector: enumerate images on page
    PageImageCollector->>PDDocument: parse page content / drawImage events
    PageImageCollector-->>ImagesUtils: return PageImageReference list
    ImagesUtils->>ImageCache: select best candidate (intersection area + tie-breaker), mark used
    ImagesUtils-->>DocumentProcessor: return BufferedImage / write file
    DocumentProcessor->>ImagesUtils: close() (end of try-with-resources)
    ImagesUtils->>ImageCache: release page images / clear cache
    ImagesUtils->>PDDocument: close document

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: avoid native crash when extracting PDF images' accurately describes the main change: replacing the previous image extraction path with PDFBox image-object extraction to prevent native crashes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`:
- Around line 53-55: The pageImagesByNumber map retains PageImageReference
(which can hold decoded BufferedImage) until close(), causing one
full-resolution raster per processed page to accumulate; update the logic in
ImagesUtils to bound the cache to a single page: change usages around
pageImagesByNumber, PageImageReference, write(), and close() so that you only
keep entries for the currently-processing page (evict/remove previous page
entries and call any image disposal/flush on the PageImageReference when moving
to the next page), and ensure close() and any error paths also clear and null
out the map and release buffered images; apply the same single-page eviction
pattern to the other related code blocks referenced (lines ~63-69, ~182-193,
~269-277).
- Around line 217-225: ensurePdfDocument currently returns early if
pdfBoxDocument is non-null, causing reuse of the first-loaded PDF for other
files; change it to track the currently loaded source (add fields like
currentPdfPath and currentPassword) and, in ensurePdfDocument, compare the
requested pdfFilePath/password to those fields—if they differ, call
pdfBoxDocument.close() (or safely release it), update the tracking fields, and
load the new document via Loader.loadPDF (respecting the password branch); if
they match, return early as before. Ensure IOException handling around
Loader.loadPDF and closing.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java`:
- Around line 80-97: The test relies on a mutable StaticLayoutContainers image
format but never sets it, causing brittle assumptions about "imageFile1.png";
update ImagesUtilsTest to explicitly set the image format before writing (call
StaticLayoutContainers.setImageFormat("png") prior to imagesUtils.writeImage or
otherwise derive the expected filename from
StaticLayoutContainers.getImageFormat()), so the assertion for
Path.of(StaticLayoutContainers.getImagesDirectory(), "imageFile1.png") always
matches the actual output format.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d9df0134-7038-44f4-ab33-61fcc469b2fe

📥 Commits

Reviewing files that changed from the base of the PR and between 85fa506 and 5b262e6.

📒 Files selected for processing (3)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java

Signed-off-by: liweixin <liweixin@cvte.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`:
- Around line 277-285: The close() method in ImagesUtils currently swallows
IOException from closePdfDocument() which hides cleanup failures; change the
method to propagate the exception by updating the signature to "public void
close() throws IOException", remove the try/catch around closePdfDocument() so
any IOException from closePdfDocument() bubbles up (after still calling
releaseAllPageImages()), and ensure callers (e.g., DocumentProcessor) handle or
propagate the IOException accordingly; alternatively, if you prefer unchecked,
wrap the caught IOException in a RuntimeException and rethrow from close(), but
do not silently log-and-suppress.
- Around line 343-355: drawImage currently stores full transformed image bounds
without applying the page clipping path, and clip()/appendRectangle() are
no-ops; update clip() and appendRectangle() to track/maintain the current
clipping path used by getGraphicsState().getCurrentClippingPath(), then in
drawImage(PDImage pdImage) obtain that clipping shape, transform it into the
same coordinate space as transformedPoint(...) and intersect it with the
computed left/right/top/bottom box (or skip storing if the intersection is
empty); finally create the PageImageReference with the intersected BoundingBox
(falling back to the original transformed bounds only if no clipping path is
present) so findBestPageImage() matches the visible, clipped image region.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java`:
- Around line 97-98: The test currently only checks file existence for
imageFile1.png in ImagesUtilsTest; enhance it to verify the file is a
readable/valid image by opening it with ImageIO.read and asserting the returned
BufferedImage is non-null (and optionally has width/height > 0) to prevent
passing on zero-byte or corrupted files—locate the pngPath usage in
ImagesUtilsTest and replace or augment the Files.exists assertion with an
ImageIO.read-based assertion (e.g., assertNotNull(image) and
assertTrue(image.getWidth() > 0 && image.getHeight() > 0)).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d7ace6cf-a7e1-43cf-babc-8fcb7e00436b

📥 Commits

Reviewing files that changed from the base of the PR and between 5b262e6 and 7c0f074.

📒 Files selected for processing (2)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java

coderabbitai · 2026-04-11T03:31:07Z

+        public void drawImage(PDImage pdImage) {
+            Point2D lowerLeft = transformedPoint(0, 0);
+            Point2D lowerRight = transformedPoint(1, 0);
+            Point2D upperLeft = transformedPoint(0, 1);
+            Point2D upperRight = transformedPoint(1, 1);
+
+            double left = Math.min(Math.min(lowerLeft.getX(), lowerRight.getX()), Math.min(upperLeft.getX(), upperRight.getX()));
+            double right = Math.max(Math.max(lowerLeft.getX(), lowerRight.getX()), Math.max(upperLeft.getX(), upperRight.getX()));
+            double bottom = Math.min(Math.min(lowerLeft.getY(), lowerRight.getY()), Math.min(upperLeft.getY(), upperRight.getY()));
+            double top = Math.max(Math.max(lowerLeft.getY(), lowerRight.getY()), Math.max(upperLeft.getY(), upperRight.getY()));
+
+            pageImages.add(new PageImageReference(pdImage, new BoundingBox(pageIndex, left, bottom, right, top)));
+        }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f -name "ImagesUtils.java"

Repository: opendataloader-project/opendataloader-pdf

Length of output: 315

🏁 Script executed:

fd "ImagesUtils.java"

Repository: opendataloader-project/opendataloader-pdf

Length of output: 173

🏁 Script executed:

wc -l java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 177

🏁 Script executed:

sed -n '330,380p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 2497

🏁 Script executed:

sed -n '1,100p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 5261

🏁 Script executed:

sed -n '300,370p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 3141

🏁 Script executed:

sed -n '330,409p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 3229

🏁 Script executed:

rg "getClippingPath\|getGraphicsState\|clipping" java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 67

🏁 Script executed:

rg "PDFGraphicsStreamEngine" --type java -A 20 | head -50

Repository: opendataloader-project/opendataloader-pdf

Length of output: 6165

🏁 Script executed:

web_search PDFGraphicsStreamEngine clipping state API

Repository: opendataloader-project/opendataloader-pdf

Length of output: 197

🌐 Web query:

PDFGraphicsStreamEngine clipping path getClippingPath method PDFBox

💡 Result:

No, the PDFGraphicsStreamEngine class in Apache PDFBox does not have a direct getClippingPath method. PDFGraphicsStreamEngine is an abstract subclass of PDFStreamEngine designed for advanced graphics processing, where subclasses must implement abstract methods like clip(int windingRule), endPath, etc., to handle path construction and clipping operations. To access the current clipping path: - Call getGraphicsState (inherited from PDFStreamEngine), which returns a PDGraphicsState object. - On PDGraphicsState, call getCurrentClippingPath, which returns an java.awt.geom.Area representing the current clipping path. Note: Do not modify the returned Area directly. This is confirmed across multiple official PDFBox API documentations (versions 2.0.7 through 3.0-alpha3) and source code analysis. Subclasses like CustomGraphicsStreamEngine (an example in PDFBox) do not override or add a getClippingPath method but inherit access via getGraphicsState.

Citations:

1: https://pdfbox.apache.org/docs/2.0.12/javadocs/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html

2: https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html

3: https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.8/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html

4: https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.24/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html

5: https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/examples/rendering/CustomGraphicsStreamEngine.html

6: https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.8/org/apache/pdfbox/pdmodel/graphics/state/PDGraphicsState.html

7: https://pdfbox.apache.org/docs/2.0.1/javadocs/org/apache/pdfbox/pdmodel/graphics/state/class-use/PDGraphicsState.html

8: https://javadoc.io/doc/org.apache.pdfbox/pdfbox/3.0.0-alpha3/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html

9: https://pdfbox.apache.org/

🏁 Script executed:

rg "pageImages" java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | head -20

Repository: opendataloader-project/opendataloader-pdf

Length of output: 1189

🏁 Script executed:

sed -n '100,200p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | grep -A 15 "getPageImages"

Repository: opendataloader-project/opendataloader-pdf

Length of output: 1486

🏁 Script executed:

sed -n '100,250p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 7659

🏁 Script executed:

sed -n '260,330p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -n

Repository: opendataloader-project/opendataloader-pdf

Length of output: 2518

Account for PDF clipping when caching image bounds.

The drawImage() method calculates full transformed bounds but ignores any clipping state. The clip() and appendRectangle() methods that should track clipping paths are empty no-ops. When PDFs contain clipped or masked images, the stored bounds will be larger than the visible region, causing findBestPageImage() to match incorrectly and potentially export pixels outside the clipped area. Use getGraphicsState().getCurrentClippingPath() to intersect the transformed bounds with the active clipping region before storing the BoundingBox.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java` around lines 343 - 355, drawImage currently stores full transformed image bounds without applying the page clipping path, and clip()/appendRectangle() are no-ops; update clip() and appendRectangle() to track/maintain the current clipping path used by getGraphicsState().getCurrentClippingPath(), then in drawImage(PDImage pdImage) obtain that clipping shape, transform it into the same coordinate space as transformedPoint(...) and intersect it with the computed left/right/top/bottom box (or skip storing if the intersection is empty); finally create the PageImageReference with the intersected BoundingBox (falling back to the original transformed bounds only if no clipping path is present) so findBestPageImage() matches the visible, clipped image region.

This PR is scoped to the native crash during image extraction. The current change replaces the old sub-image path with direct PDFBox image-object extraction and keeps the matching logic minimal for that bugfix.

Clipping-aware bounds would change the image-matching semantics and needs representative clipped-image samples to validate behavior across PDFs, so I am leaving that out of this PR. If maintainers want it, I can follow up separately with targeted samples and tests.

@the-waste-land, that's a fair and well-reasoned scope decision. The native-crash fix stands on its own and the clipping-aware matching is a separate concern that warrants dedicated samples and validation.

Would you like me to open a follow-up GitHub issue to track the clipping-aware bounds improvement so it isn't lost? I can include a backlink to this PR and the relevant context.

✏️ Learnings added

Learnt from: the-waste-land Repo: opendataloader-project/opendataloader-pdf PR: 409 File: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java:339-351 Timestamp: 2026-04-11T14:50:32.262Z Learning: In `ImagesUtils.java` (Java, opendataloader-pdf), the `drawImage()` method in `PageImageCollector` intentionally stores full transformed bounds without applying the clipping path. The PR author (the-waste-land) explicitly deferred clipping-aware bound intersection to a follow-up, as it requires representative clipped-image PDF samples and targeted tests. Do not flag this as a bug in PRs that are scoped only to the native-crash fix (direct PDFBox image-object extraction path).

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: hamid17amu Repo: opendataloader-project/opendataloader-pdf PR: 0 File: :0-0 Timestamp: 2026-04-11T10:19:08.964Z Learning: In `DocumentProcessor.java` (Java, opendataloader-pdf), `StaticLayoutContainers.closeContrastRatioConsumer()` does not expose verification APIs (e.g., `getContrastRatioConsumer()` or `hasContrastRatioConsumer()`). Verifying the closed state of the consumer would require changes to `StaticLayoutContainers` internals, which is out of scope for resource-cleanup-only changes.

Signed-off-by: liweixin <liweixin@cvte.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`:
- Around line 119-127: The writePicture method depends on
picture.getPictureIndex() == 1 to create the images directory which is brittle;
instead, always ensure the directory exists before writing: call
createImagesDirectory(StaticLayoutContainers.getImagesDirectory())
unconditionally at the start of writePicture (or explicitly check File dir = new
File(StaticLayoutContainers.getImagesDirectory()) and create it if
!dir.exists()) so createImageFile won’t fail when indices are non-sequential;
update the writePicture method (referenced symbols: writePicture,
createImagesDirectory, StaticLayoutContainers.getImagesDirectory,
createImageFile, SemanticPicture.getPictureIndex) accordingly.
- Around line 338-359: drawImage currently records the full transformed image
bbox via PageImageReference and BoundingBox without considering any clipping;
implement tracking of the current clipping region (update state in
appendRectangle and clip) and intersect that clip with the transformed bbox
before adding to pageImages so stored bounds reflect visible area (use methods
around drawImage, appendRectangle, clip, pageImages, PageImageReference,
BoundingBox and pageIndex to locate code); if you decide not to support
clipping, add a clear TODO/comment in drawImage explaining that
appendRectangle/clip are no-ops and that stored bounds ignore clipping so
callers must handle clipped images.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cfc44069-d81c-4625-ad2d-2b69efeaf401

📥 Commits

Reviewing files that changed from the base of the PR and between 7c0f074 and 2949c5f.

📒 Files selected for processing (2)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java

coderabbitai · 2026-04-11T04:52:26Z

    protected void writePicture(SemanticPicture picture, String pdfFilePath, String password) {
        int pictureIndex = picture.getPictureIndex();
-        if (contrastRatioConsumer == null) {
+        if (pictureIndex == 1) {
            createImagesDirectory(StaticLayoutContainers.getImagesDirectory());
-            contrastRatioConsumer = StaticLayoutContainers.getContrastRatioConsumer(pdfFilePath, password, false, null);
        }
        String imageFormat = StaticLayoutContainers.getImageFormat();
        String fileName = String.format(MarkdownSyntax.IMAGE_FILE_NAME_FORMAT, StaticLayoutContainers.getImagesDirectory(), File.separator, pictureIndex, imageFormat);
-        createImageFile(picture.getBoundingBox(), fileName, imageFormat);
+        createImageFile(picture.getBoundingBox(), fileName, imageFormat, pdfFilePath, password);
    }


⚠️ Potential issue | 🟡 Minor

Directory creation relies on external picture indexing.

The writePicture method assumes picture.getPictureIndex() returns 1 for the first picture to trigger directory creation. If SemanticPicture indices don't start at 1 or are not sequential, the images directory may not be created before attempting to write the first picture file.

Consider using a flag or checking if the directory exists rather than relying on index values.

Suggested improvement

protected void writePicture(SemanticPicture picture, String pdfFilePath, String password) { int pictureIndex = picture.getPictureIndex(); - if (pictureIndex == 1) { - createImagesDirectory(StaticLayoutContainers.getImagesDirectory()); - } + createImagesDirectory(StaticLayoutContainers.getImagesDirectory()); String imageFormat = StaticLayoutContainers.getImageFormat();

Note: createImagesDirectory already checks !directory.exists() so calling it unconditionally is safe and more robust.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java` around lines 119 - 127, The writePicture method depends on picture.getPictureIndex() == 1 to create the images directory which is brittle; instead, always ensure the directory exists before writing: call createImagesDirectory(StaticLayoutContainers.getImagesDirectory()) unconditionally at the start of writePicture (or explicitly check File dir = new File(StaticLayoutContainers.getImagesDirectory()) and create it if !dir.exists()) so createImageFile won’t fail when indices are non-sequential; update the writePicture method (referenced symbols: writePicture, createImagesDirectory, StaticLayoutContainers.getImagesDirectory, createImageFile, SemanticPicture.getPictureIndex) accordingly.

coderabbitai · 2026-04-11T04:52:26Z

+        @Override
+        public void drawImage(PDImage pdImage) {
+            Point2D lowerLeft = transformedPoint(0, 0);
+            Point2D lowerRight = transformedPoint(1, 0);
+            Point2D upperLeft = transformedPoint(0, 1);
+            Point2D upperRight = transformedPoint(1, 1);
+
+            double left = Math.min(Math.min(lowerLeft.getX(), lowerRight.getX()), Math.min(upperLeft.getX(), upperRight.getX()));
+            double right = Math.max(Math.max(lowerLeft.getX(), lowerRight.getX()), Math.max(upperLeft.getX(), upperRight.getX()));
+            double bottom = Math.min(Math.min(lowerLeft.getY(), lowerRight.getY()), Math.min(upperLeft.getY(), upperRight.getY()));
+            double top = Math.max(Math.max(lowerLeft.getY(), lowerRight.getY()), Math.max(upperLeft.getY(), upperRight.getY()));
+
+            pageImages.add(new PageImageReference(pdImage, new BoundingBox(pageIndex, left, bottom, right, top)));
+        }
+
+        @Override
+        public void appendRectangle(Point2D point0, Point2D point1, Point2D point2, Point2D point3) {
+        }
+
+        @Override
+        public void clip(int windingRule) {
+        }


🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if there are any test PDFs or documentation about clipping support rg -i "clip" --type java -C3 java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/

Repository: opendataloader-project/opendataloader-pdf

Length of output: 809

🏁 Script executed:

wc -l java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 177

🏁 Script executed:

cat -n java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 18564

🏁 Script executed:

rg "ForkJoinPool" java/opendataloader-pdf-core/

Repository: opendataloader-project/opendataloader-pdf

Length of output: 385

🏁 Script executed:

sed -n '1,100p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 4578

🏁 Script executed:

sed -n '100,250p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 7885

🏁 Script executed:

sed -n '250,350p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 5030

🏁 Script executed:

rg -A5 -B5 "new PageImageCollector" java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java

Repository: opendataloader-project/opendataloader-pdf

Length of output: 516

Clipping path not applied to stored image bounds.

The drawImage() method stores the full transformed bounding box without intersecting it with the active clipping path. The clip() and appendRectangle() methods are no-ops, so clipping state is never tracked. For PDFs with clipped or masked images, the stored bounds will be larger than the visible region, which may cause findBestPageImage() to select incorrect candidates when multiple images overlap the target area.

If this is an acceptable limitation for the current use case (e.g., scanned PDFs typically don't use clipping on images), consider adding a code comment documenting this constraint.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java` around lines 338 - 359, drawImage currently records the full transformed image bbox via PageImageReference and BoundingBox without considering any clipping; implement tracking of the current clipping region (update state in appendRectangle and clip) and intersect that clip with the transformed bbox before adding to pageImages so stored bounds reflect visible area (use methods around drawImage, appendRectangle, clip, pageImages, PageImageReference, BoundingBox and pageIndex to locate code); if you decide not to support clipping, add a clear TODO/comment in drawImage explaining that appendRectangle/clip are no-ops and that stored bounds ignore clipping so callers must handle clipped images.

bundolee · 2026-04-17T13:37:12Z

Here's how the review process will go from here:

CodeRabbit — all review threads resolved with a fix commit
CI — all checks pass (codecov excluded)
Component owner approval
- Java (java/) → @MaximPlusov
- Others → @bundolee

Fix avoid native crash when extracting PDF images

5b262e6

Signed-off-by: liweixin <liweixin@cvte.com>

the-waste-land requested review from LonelyMidoriya, MaximPlusov, bundolee, hnc-jglee and hyunhee-jo as code owners April 11, 2026 02:58

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

Fix review follow-ups for image extraction

7c0f074

Signed-off-by: liweixin <liweixin@cvte.com>

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

Fix follow-up review comments

2949c5f

Signed-off-by: liweixin <liweixin@cvte.com>

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

Conversation

the-waste-land commented Apr 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

CLAassistant commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

the-waste-land Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

bundolee commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

the-waste-land commented Apr 11, 2026 •

edited by coderabbitai bot

Loading

CLAassistant commented Apr 11, 2026 •

edited

Loading

coderabbitai bot commented Apr 11, 2026 •

edited

Loading

coderabbitai bot Apr 11, 2026 •

edited

Loading