fix: avoid native crash when extracting PDF images#409
fix: avoid native crash when extracting PDF images#409the-waste-land wants to merge 3 commits intoopendataloader-project:mainfrom
Conversation
Signed-off-by: liweixin <liweixin@cvte.com>
WalkthroughDocumentProcessor now uses try-with-resources to manage ImagesUtils. ImagesUtils implements AutoCloseable, replaces contrast-ratio consumer logic with a PDFBox-driven per-page image extraction/caching pipeline (PageImageCollector / PageImageReference) and adds explicit release/close methods. Tests updated to validate PNG extraction. Changes
Sequence Diagram(s)sequenceDiagram
participant DocumentProcessor
participant ImagesUtils
participant PDDocument
participant PageImageCollector
participant ImageCache
DocumentProcessor->>ImagesUtils: new ImagesUtils()
DocumentProcessor->>ImagesUtils: writeImage(chunk, pdfPath, password)
Note right of ImagesUtils: extractPageImage(imageBox, pdfPath, password)
ImagesUtils->>PDDocument: load or reuse PDDocument (lazy)
ImagesUtils->>PageImageCollector: enumerate images on page
PageImageCollector->>PDDocument: parse page content / drawImage events
PageImageCollector-->>ImagesUtils: return PageImageReference list
ImagesUtils->>ImageCache: select best candidate (intersection area + tie-breaker), mark used
ImagesUtils-->>DocumentProcessor: return BufferedImage / write file
DocumentProcessor->>ImagesUtils: close() (end of try-with-resources)
ImagesUtils->>ImageCache: release page images / clear cache
ImagesUtils->>PDDocument: close document
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`:
- Around line 53-55: The pageImagesByNumber map retains PageImageReference
(which can hold decoded BufferedImage) until close(), causing one
full-resolution raster per processed page to accumulate; update the logic in
ImagesUtils to bound the cache to a single page: change usages around
pageImagesByNumber, PageImageReference, write(), and close() so that you only
keep entries for the currently-processing page (evict/remove previous page
entries and call any image disposal/flush on the PageImageReference when moving
to the next page), and ensure close() and any error paths also clear and null
out the map and release buffered images; apply the same single-page eviction
pattern to the other related code blocks referenced (lines ~63-69, ~182-193,
~269-277).
- Around line 217-225: ensurePdfDocument currently returns early if
pdfBoxDocument is non-null, causing reuse of the first-loaded PDF for other
files; change it to track the currently loaded source (add fields like
currentPdfPath and currentPassword) and, in ensurePdfDocument, compare the
requested pdfFilePath/password to those fields—if they differ, call
pdfBoxDocument.close() (or safely release it), update the tracking fields, and
load the new document via Loader.loadPDF (respecting the password branch); if
they match, return early as before. Ensure IOException handling around
Loader.loadPDF and closing.
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java`:
- Around line 80-97: The test relies on a mutable StaticLayoutContainers image
format but never sets it, causing brittle assumptions about "imageFile1.png";
update ImagesUtilsTest to explicitly set the image format before writing (call
StaticLayoutContainers.setImageFormat("png") prior to imagesUtils.writeImage or
otherwise derive the expected filename from
StaticLayoutContainers.getImageFormat()), so the assertion for
Path.of(StaticLayoutContainers.getImagesDirectory(), "imageFile1.png") always
matches the actual output format.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d9df0134-7038-44f4-ab33-61fcc469b2fe
📒 Files selected for processing (3)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.javajava/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.javajava/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java
Signed-off-by: liweixin <liweixin@cvte.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`:
- Around line 277-285: The close() method in ImagesUtils currently swallows
IOException from closePdfDocument() which hides cleanup failures; change the
method to propagate the exception by updating the signature to "public void
close() throws IOException", remove the try/catch around closePdfDocument() so
any IOException from closePdfDocument() bubbles up (after still calling
releaseAllPageImages()), and ensure callers (e.g., DocumentProcessor) handle or
propagate the IOException accordingly; alternatively, if you prefer unchecked,
wrap the caught IOException in a RuntimeException and rethrow from close(), but
do not silently log-and-suppress.
- Around line 343-355: drawImage currently stores full transformed image bounds
without applying the page clipping path, and clip()/appendRectangle() are
no-ops; update clip() and appendRectangle() to track/maintain the current
clipping path used by getGraphicsState().getCurrentClippingPath(), then in
drawImage(PDImage pdImage) obtain that clipping shape, transform it into the
same coordinate space as transformedPoint(...) and intersect it with the
computed left/right/top/bottom box (or skip storing if the intersection is
empty); finally create the PageImageReference with the intersected BoundingBox
(falling back to the original transformed bounds only if no clipping path is
present) so findBestPageImage() matches the visible, clipped image region.
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java`:
- Around line 97-98: The test currently only checks file existence for
imageFile1.png in ImagesUtilsTest; enhance it to verify the file is a
readable/valid image by opening it with ImageIO.read and asserting the returned
BufferedImage is non-null (and optionally has width/height > 0) to prevent
passing on zero-byte or corrupted files—locate the pngPath usage in
ImagesUtilsTest and replace or augment the Files.exists assertion with an
ImageIO.read-based assertion (e.g., assertNotNull(image) and
assertTrue(image.getWidth() > 0 && image.getHeight() > 0)).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d7ace6cf-a7e1-43cf-babc-8fcb7e00436b
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.javajava/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java
| public void drawImage(PDImage pdImage) { | ||
| Point2D lowerLeft = transformedPoint(0, 0); | ||
| Point2D lowerRight = transformedPoint(1, 0); | ||
| Point2D upperLeft = transformedPoint(0, 1); | ||
| Point2D upperRight = transformedPoint(1, 1); | ||
|
|
||
| double left = Math.min(Math.min(lowerLeft.getX(), lowerRight.getX()), Math.min(upperLeft.getX(), upperRight.getX())); | ||
| double right = Math.max(Math.max(lowerLeft.getX(), lowerRight.getX()), Math.max(upperLeft.getX(), upperRight.getX())); | ||
| double bottom = Math.min(Math.min(lowerLeft.getY(), lowerRight.getY()), Math.min(upperLeft.getY(), upperRight.getY())); | ||
| double top = Math.max(Math.max(lowerLeft.getY(), lowerRight.getY()), Math.max(upperLeft.getY(), upperRight.getY())); | ||
|
|
||
| pageImages.add(new PageImageReference(pdImage, new BoundingBox(pageIndex, left, bottom, right, top))); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f -name "ImagesUtils.java"Repository: opendataloader-project/opendataloader-pdf
Length of output: 315
🏁 Script executed:
fd "ImagesUtils.java"Repository: opendataloader-project/opendataloader-pdf
Length of output: 173
🏁 Script executed:
wc -l java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.javaRepository: opendataloader-project/opendataloader-pdf
Length of output: 177
🏁 Script executed:
sed -n '330,380p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -nRepository: opendataloader-project/opendataloader-pdf
Length of output: 2497
🏁 Script executed:
sed -n '1,100p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -nRepository: opendataloader-project/opendataloader-pdf
Length of output: 5261
🏁 Script executed:
sed -n '300,370p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -nRepository: opendataloader-project/opendataloader-pdf
Length of output: 3141
🏁 Script executed:
sed -n '330,409p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -nRepository: opendataloader-project/opendataloader-pdf
Length of output: 3229
🏁 Script executed:
rg "getClippingPath\|getGraphicsState\|clipping" java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.javaRepository: opendataloader-project/opendataloader-pdf
Length of output: 67
🏁 Script executed:
rg "PDFGraphicsStreamEngine" --type java -A 20 | head -50Repository: opendataloader-project/opendataloader-pdf
Length of output: 6165
🏁 Script executed:
web_search
PDFGraphicsStreamEngine clipping state APIRepository: opendataloader-project/opendataloader-pdf
Length of output: 197
🌐 Web query:
PDFGraphicsStreamEngine clipping path getClippingPath method PDFBox
💡 Result:
No, the PDFGraphicsStreamEngine class in Apache PDFBox does not have a direct getClippingPath method. PDFGraphicsStreamEngine is an abstract subclass of PDFStreamEngine designed for advanced graphics processing, where subclasses must implement abstract methods like clip(int windingRule), endPath, etc., to handle path construction and clipping operations. To access the current clipping path: - Call getGraphicsState (inherited from PDFStreamEngine), which returns a PDGraphicsState object. - On PDGraphicsState, call getCurrentClippingPath, which returns an java.awt.geom.Area representing the current clipping path. Note: Do not modify the returned Area directly. This is confirmed across multiple official PDFBox API documentations (versions 2.0.7 through 3.0-alpha3) and source code analysis. Subclasses like CustomGraphicsStreamEngine (an example in PDFBox) do not override or add a getClippingPath method but inherit access via getGraphicsState.
Citations:
- 1: https://pdfbox.apache.org/docs/2.0.12/javadocs/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html
- 2: https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html
- 3: https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.8/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html
- 4: https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.24/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html
- 5: https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/examples/rendering/CustomGraphicsStreamEngine.html
- 6: https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.8/org/apache/pdfbox/pdmodel/graphics/state/PDGraphicsState.html
- 7: https://pdfbox.apache.org/docs/2.0.1/javadocs/org/apache/pdfbox/pdmodel/graphics/state/class-use/PDGraphicsState.html
- 8: https://javadoc.io/doc/org.apache.pdfbox/pdfbox/3.0.0-alpha3/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html
- 9: https://pdfbox.apache.org/
🏁 Script executed:
rg "pageImages" java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | head -20Repository: opendataloader-project/opendataloader-pdf
Length of output: 1189
🏁 Script executed:
sed -n '100,200p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | grep -A 15 "getPageImages"Repository: opendataloader-project/opendataloader-pdf
Length of output: 1486
🏁 Script executed:
sed -n '100,250p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -nRepository: opendataloader-project/opendataloader-pdf
Length of output: 7659
🏁 Script executed:
sed -n '260,330p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java | cat -nRepository: opendataloader-project/opendataloader-pdf
Length of output: 2518
Account for PDF clipping when caching image bounds.
The drawImage() method calculates full transformed bounds but ignores any clipping state. The clip() and appendRectangle() methods that should track clipping paths are empty no-ops. When PDFs contain clipped or masked images, the stored bounds will be larger than the visible region, causing findBestPageImage() to match incorrectly and potentially export pixels outside the clipped area. Use getGraphicsState().getCurrentClippingPath() to intersect the transformed bounds with the active clipping region before storing the BoundingBox.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`
around lines 343 - 355, drawImage currently stores full transformed image bounds
without applying the page clipping path, and clip()/appendRectangle() are
no-ops; update clip() and appendRectangle() to track/maintain the current
clipping path used by getGraphicsState().getCurrentClippingPath(), then in
drawImage(PDImage pdImage) obtain that clipping shape, transform it into the
same coordinate space as transformedPoint(...) and intersect it with the
computed left/right/top/bottom box (or skip storing if the intersection is
empty); finally create the PageImageReference with the intersected BoundingBox
(falling back to the original transformed bounds only if no clipping path is
present) so findBestPageImage() matches the visible, clipped image region.
There was a problem hiding this comment.
This PR is scoped to the native crash during image extraction. The current change replaces the old sub-image path with direct PDFBox image-object extraction and keeps the matching logic minimal for that bugfix.
Clipping-aware bounds would change the image-matching semantics and needs representative clipped-image samples to validate behavior across PDFs, so I am leaving that out of this PR. If maintainers want it, I can follow up separately with targeted samples and tests.
There was a problem hiding this comment.
@the-waste-land, that's a fair and well-reasoned scope decision. The native-crash fix stands on its own and the clipping-aware matching is a separate concern that warrants dedicated samples and validation.
Would you like me to open a follow-up GitHub issue to track the clipping-aware bounds improvement so it isn't lost? I can include a backlink to this PR and the relevant context.
✏️ Learnings added
Learnt from: the-waste-land
Repo: opendataloader-project/opendataloader-pdf PR: 409
File: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java:339-351
Timestamp: 2026-04-11T14:50:32.262Z
Learning: In `ImagesUtils.java` (Java, opendataloader-pdf), the `drawImage()` method in `PageImageCollector` intentionally stores full transformed bounds without applying the clipping path. The PR author (the-waste-land) explicitly deferred clipping-aware bound intersection to a follow-up, as it requires representative clipped-image PDF samples and targeted tests. Do not flag this as a bug in PRs that are scoped only to the native-crash fix (direct PDFBox image-object extraction path).
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: hamid17amu
Repo: opendataloader-project/opendataloader-pdf PR: 0
File: :0-0
Timestamp: 2026-04-11T10:19:08.964Z
Learning: In `DocumentProcessor.java` (Java, opendataloader-pdf), `StaticLayoutContainers.closeContrastRatioConsumer()` does not expose verification APIs (e.g., `getContrastRatioConsumer()` or `hasContrastRatioConsumer()`). Verifying the closed state of the consumer would require changes to `StaticLayoutContainers` internals, which is out of scope for resource-cleanup-only changes.
Signed-off-by: liweixin <liweixin@cvte.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`:
- Around line 119-127: The writePicture method depends on
picture.getPictureIndex() == 1 to create the images directory which is brittle;
instead, always ensure the directory exists before writing: call
createImagesDirectory(StaticLayoutContainers.getImagesDirectory())
unconditionally at the start of writePicture (or explicitly check File dir = new
File(StaticLayoutContainers.getImagesDirectory()) and create it if
!dir.exists()) so createImageFile won’t fail when indices are non-sequential;
update the writePicture method (referenced symbols: writePicture,
createImagesDirectory, StaticLayoutContainers.getImagesDirectory,
createImageFile, SemanticPicture.getPictureIndex) accordingly.
- Around line 338-359: drawImage currently records the full transformed image
bbox via PageImageReference and BoundingBox without considering any clipping;
implement tracking of the current clipping region (update state in
appendRectangle and clip) and intersect that clip with the transformed bbox
before adding to pageImages so stored bounds reflect visible area (use methods
around drawImage, appendRectangle, clip, pageImages, PageImageReference,
BoundingBox and pageIndex to locate code); if you decide not to support
clipping, add a clear TODO/comment in drawImage explaining that
appendRectangle/clip are no-ops and that stored bounds ignore clipping so
callers must handle clipped images.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cfc44069-d81c-4625-ad2d-2b69efeaf401
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.javajava/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/utils/ImagesUtilsTest.java
| protected void writePicture(SemanticPicture picture, String pdfFilePath, String password) { | ||
| int pictureIndex = picture.getPictureIndex(); | ||
| if (contrastRatioConsumer == null) { | ||
| if (pictureIndex == 1) { | ||
| createImagesDirectory(StaticLayoutContainers.getImagesDirectory()); | ||
| contrastRatioConsumer = StaticLayoutContainers.getContrastRatioConsumer(pdfFilePath, password, false, null); | ||
| } | ||
| String imageFormat = StaticLayoutContainers.getImageFormat(); | ||
| String fileName = String.format(MarkdownSyntax.IMAGE_FILE_NAME_FORMAT, StaticLayoutContainers.getImagesDirectory(), File.separator, pictureIndex, imageFormat); | ||
| createImageFile(picture.getBoundingBox(), fileName, imageFormat); | ||
| createImageFile(picture.getBoundingBox(), fileName, imageFormat, pdfFilePath, password); | ||
| } |
There was a problem hiding this comment.
Directory creation relies on external picture indexing.
The writePicture method assumes picture.getPictureIndex() returns 1 for the first picture to trigger directory creation. If SemanticPicture indices don't start at 1 or are not sequential, the images directory may not be created before attempting to write the first picture file.
Consider using a flag or checking if the directory exists rather than relying on index values.
Suggested improvement
protected void writePicture(SemanticPicture picture, String pdfFilePath, String password) {
int pictureIndex = picture.getPictureIndex();
- if (pictureIndex == 1) {
- createImagesDirectory(StaticLayoutContainers.getImagesDirectory());
- }
+ createImagesDirectory(StaticLayoutContainers.getImagesDirectory());
String imageFormat = StaticLayoutContainers.getImageFormat();Note: createImagesDirectory already checks !directory.exists() so calling it unconditionally is safe and more robust.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`
around lines 119 - 127, The writePicture method depends on
picture.getPictureIndex() == 1 to create the images directory which is brittle;
instead, always ensure the directory exists before writing: call
createImagesDirectory(StaticLayoutContainers.getImagesDirectory())
unconditionally at the start of writePicture (or explicitly check File dir = new
File(StaticLayoutContainers.getImagesDirectory()) and create it if
!dir.exists()) so createImageFile won’t fail when indices are non-sequential;
update the writePicture method (referenced symbols: writePicture,
createImagesDirectory, StaticLayoutContainers.getImagesDirectory,
createImageFile, SemanticPicture.getPictureIndex) accordingly.
| @Override | ||
| public void drawImage(PDImage pdImage) { | ||
| Point2D lowerLeft = transformedPoint(0, 0); | ||
| Point2D lowerRight = transformedPoint(1, 0); | ||
| Point2D upperLeft = transformedPoint(0, 1); | ||
| Point2D upperRight = transformedPoint(1, 1); | ||
|
|
||
| double left = Math.min(Math.min(lowerLeft.getX(), lowerRight.getX()), Math.min(upperLeft.getX(), upperRight.getX())); | ||
| double right = Math.max(Math.max(lowerLeft.getX(), lowerRight.getX()), Math.max(upperLeft.getX(), upperRight.getX())); | ||
| double bottom = Math.min(Math.min(lowerLeft.getY(), lowerRight.getY()), Math.min(upperLeft.getY(), upperRight.getY())); | ||
| double top = Math.max(Math.max(lowerLeft.getY(), lowerRight.getY()), Math.max(upperLeft.getY(), upperRight.getY())); | ||
|
|
||
| pageImages.add(new PageImageReference(pdImage, new BoundingBox(pageIndex, left, bottom, right, top))); | ||
| } | ||
|
|
||
| @Override | ||
| public void appendRectangle(Point2D point0, Point2D point1, Point2D point2, Point2D point3) { | ||
| } | ||
|
|
||
| @Override | ||
| public void clip(int windingRule) { | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if there are any test PDFs or documentation about clipping support
rg -i "clip" --type java -C3 java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/Repository: opendataloader-project/opendataloader-pdf
Length of output: 809
🏁 Script executed:
wc -l java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.javaRepository: opendataloader-project/opendataloader-pdf
Length of output: 177
🏁 Script executed:
cat -n java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.javaRepository: opendataloader-project/opendataloader-pdf
Length of output: 18564
🏁 Script executed:
rg "ForkJoinPool" java/opendataloader-pdf-core/Repository: opendataloader-project/opendataloader-pdf
Length of output: 385
🏁 Script executed:
sed -n '1,100p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.javaRepository: opendataloader-project/opendataloader-pdf
Length of output: 4578
🏁 Script executed:
sed -n '100,250p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.javaRepository: opendataloader-project/opendataloader-pdf
Length of output: 7885
🏁 Script executed:
sed -n '250,350p' java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.javaRepository: opendataloader-project/opendataloader-pdf
Length of output: 5030
🏁 Script executed:
rg -A5 -B5 "new PageImageCollector" java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.javaRepository: opendataloader-project/opendataloader-pdf
Length of output: 516
Clipping path not applied to stored image bounds.
The drawImage() method stores the full transformed bounding box without intersecting it with the active clipping path. The clip() and appendRectangle() methods are no-ops, so clipping state is never tracked. For PDFs with clipped or masked images, the stored bounds will be larger than the visible region, which may cause findBestPageImage() to select incorrect candidates when multiple images overlap the target area.
If this is an acceptable limitation for the current use case (e.g., scanned PDFs typically don't use clipping on images), consider adding a code comment documenting this constraint.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java`
around lines 338 - 359, drawImage currently records the full transformed image
bbox via PageImageReference and BoundingBox without considering any clipping;
implement tracking of the current clipping region (update state in
appendRectangle and clip) and intersect that clip with the transformed bbox
before adding to pageImages so stored bounds reflect visible area (use methods
around drawImage, appendRectangle, clip, pageImages, PageImageReference,
BoundingBox and pageIndex to locate code); if you decide not to support
clipping, add a clear TODO/comment in drawImage explaining that
appendRectangle/clip are no-ops and that stored bounds ignore clipping so
callers must handle clipped images.
|
Here's how the review process will go from here:
|
This replaces the old page sub-image extraction path with direct PDFBox image-object extraction, then decodes PDImageXObject instances with getOpaqueImage() to avoid native crashes on PDFs that contain transparency/soft masks.
It also closes the PDFBox document through try-with-resources and adds a regression test that verifies image extraction against a scanned PDF sample.
Checklist:
Verification:
Note: scripts/bench.sh currently exits before benchmark execution with "ARGS[@]: unbound variable" when no extra args are passed, so I could not complete that repository-level check from this branch.
Summary by CodeRabbit