Conversation
WalkthroughDefault image output directory naming now uses the filename portion before the last dot and replaces spaces with underscores; verapdf Maven property upper bound changed from exclusive to inclusive ( Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`:
- Line 328: DocumentProcessor currently uses
fileName.substring(fileName.length() - 4) to strip an extension into baseName
which fails for names with different extension lengths or no extension; change
the logic to find the last dot (fileName.lastIndexOf('.')) and take substring up
to that index (with a safe fallback to the whole fileName when no dot exists),
also guard for null/empty fileName, then apply the replace(" ", "_") on the
resulting baseName so you don't risk StringIndexOutOfBoundsException or
incorrect names.
In `@java/pom.xml`:
- Line 61: The verapdf.version Maven property currently uses an invalid/unstable
range "[1.31.0,1.32.0-RC]"; update the verapdf.version property to a stable,
published release (e.g., set verapdf.version to the exact stable "1.28.2") or
explicitly document the range if these are internal dev releases, and avoid
allowing pre-release qualifiers (RC) in release builds—alternatively constrain
the range to exclude prereleases or pin to an exact version to ensure
reproducible builds.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8db3829a-a1f8-4265-bbf7-1e9fdfce45fd
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.javajava/pom.xml
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`:
- Around line 327-331: Add an integration test in ImageDirIntegrationTest to
cover filenames with spaces and multiple dots (e.g., "my paper.v1.pdf") and
assert that DocumentProcessor constructs imagesDirectory using the sanitized
base name (DocumentProcessor: variable baseName/imagesDirectory) as
"my_paper.v1_images"; update or add tests to instantiate
DocumentProcessor.process (or the relevant method that sets imagesDirectory)
with such inputPdfName and verify the output folder path matches expected
MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX behavior for both spaced and multi-dot
names.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4d36c3dd-1cca-4b33-b02b-b622b48812fd
📒 Files selected for processing (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
| String fileName = Paths.get(inputPdfName).getFileName().toString(); | ||
| String baseName = fileName.substring(0, fileName.length() - 4); | ||
| + int dotIndex = fileName.lastIndexOf('.'); | ||
| + String rawBaseName = dotIndex > 0 ? fileName.substring(0, dotIndex) : fileName; | ||
| + String baseName = rawBaseName.replace(" ", "_"); | ||
| imagesDirectory = config.getOutputFolder() + File.separator + baseName + MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add a regression test for spaced filenames (and multi-dot names).
Please add/update integration coverage to validate default image directory naming for inputs like "my paper.v1.pdf" (expect "my_paper.v1_images"), since current ImageDirIntegrationTest only covers 1901.03003.pdf.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`
around lines 327 - 331, Add an integration test in ImageDirIntegrationTest to
cover filenames with spaces and multiple dots (e.g., "my paper.v1.pdf") and
assert that DocumentProcessor constructs imagesDirectory using the sanitized
base name (DocumentProcessor: variable baseName/imagesDirectory) as
"my_paper.v1_images"; update or add tests to instantiate
DocumentProcessor.process (or the relevant method that sets imagesDirectory)
with such inputPdfName and verify the output folder path matches expected
MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX behavior for both spaced and multi-dot
names.
I have addressed the issue where images extracted from PDFs with filenames containing spaces fail to render in Markdown format.
Following your suggestion, I updated DocumentProcessor.java to replace spaces with underscores in the image directory name before it is assigned to StaticLayoutContainers.
This ensures both the output directory and referenced paths use consistent naming (for example, my_paper_images instead of my paper_images), which resolves the broken links without requiring separate URL encoding logic for Markdown or HTML generators.
I also verified that the project compiles successfully after the change.
Please let me know if any additional changes or testing are needed.
Summary by CodeRabbit
Bug Fixes
Chores