Skip to content

Bug Fix#411

Open
ankitsingh0913 wants to merge 5 commits intoopendataloader-project:mainfrom
ankitsingh0913:bugFix
Open

Bug Fix#411
ankitsingh0913 wants to merge 5 commits intoopendataloader-project:mainfrom
ankitsingh0913:bugFix

Conversation

@ankitsingh0913
Copy link
Copy Markdown

@ankitsingh0913 ankitsingh0913 commented Apr 11, 2026

I have addressed the issue where images extracted from PDFs with filenames containing spaces fail to render in Markdown format.

Following your suggestion, I updated DocumentProcessor.java to replace spaces with underscores in the image directory name before it is assigned to StaticLayoutContainers.

This ensures both the output directory and referenced paths use consistent naming (for example, my_paper_images instead of my paper_images), which resolves the broken links without requiring separate URL encoding logic for Markdown or HTML generators.

I also verified that the project compiles successfully after the change.

Please let me know if any additional changes or testing are needed.

Summary by CodeRabbit

  • Bug Fixes

    • Image output directory names now preserve filename stems (handles names without extensions) and replace spaces with underscores to avoid malformed image folders when no custom image directory is set.
  • Chores

    • Adjusted a dependency version range for improved resolution and compatibility.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 11, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 11, 2026

Walkthrough

Default image output directory naming now uses the filename portion before the last dot and replaces spaces with underscores; verapdf Maven property upper bound changed from exclusive to inclusive (1.32.0-RC).

Changes

Cohort / File(s) Summary
Image Directory Naming
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
Changed logic that derives default imagesDirectory: extract part before last . (handles missing extensions) and replace spaces with underscores before appending MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX.
Dependency Version Update
java/pom.xml
Updated ${verapdf.version} range upper bound from [1.31.0,1.32.0-RC) to [1.31.0,1.32.0-RC] (inclusive).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

Suggested reviewers

  • MaximPlusov
  • LonelyMidoriya
  • hnc-jglee
  • hyunhee-jo
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Bug Fix' is generic and non-descriptive; it does not convey meaningful information about the specific changes (filename extraction and space handling in image directory naming). Use a more descriptive title that summarizes the main change, such as 'Fix PDF image rendering by replacing spaces with underscores in directory names' or 'Handle filename extraction safely for image directories'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`:
- Line 328: DocumentProcessor currently uses
fileName.substring(fileName.length() - 4) to strip an extension into baseName
which fails for names with different extension lengths or no extension; change
the logic to find the last dot (fileName.lastIndexOf('.')) and take substring up
to that index (with a safe fallback to the whole fileName when no dot exists),
also guard for null/empty fileName, then apply the replace(" ", "_") on the
resulting baseName so you don't risk StringIndexOutOfBoundsException or
incorrect names.

In `@java/pom.xml`:
- Line 61: The verapdf.version Maven property currently uses an invalid/unstable
range "[1.31.0,1.32.0-RC]"; update the verapdf.version property to a stable,
published release (e.g., set verapdf.version to the exact stable "1.28.2") or
explicitly document the range if these are internal dev releases, and avoid
allowing pre-release qualifiers (RC) in release builds—alternatively constrain
the range to exclude prereleases or pin to an exact version to ensure
reproducible builds.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8db3829a-a1f8-4265-bbf7-1e9fdfce45fd

📥 Commits

Reviewing files that changed from the base of the PR and between 85fa506 and 334f58b.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
  • java/pom.xml

Comment thread java/pom.xml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`:
- Around line 327-331: Add an integration test in ImageDirIntegrationTest to
cover filenames with spaces and multiple dots (e.g., "my paper.v1.pdf") and
assert that DocumentProcessor constructs imagesDirectory using the sanitized
base name (DocumentProcessor: variable baseName/imagesDirectory) as
"my_paper.v1_images"; update or add tests to instantiate
DocumentProcessor.process (or the relevant method that sets imagesDirectory)
with such inputPdfName and verify the output folder path matches expected
MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX behavior for both spaced and multi-dot
names.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4d36c3dd-1cca-4b33-b02b-b622b48812fd

📥 Commits

Reviewing files that changed from the base of the PR and between 334f58b and c50e40e.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

Comment on lines 327 to 331
String fileName = Paths.get(inputPdfName).getFileName().toString();
String baseName = fileName.substring(0, fileName.length() - 4);
+ int dotIndex = fileName.lastIndexOf('.');
+ String rawBaseName = dotIndex > 0 ? fileName.substring(0, dotIndex) : fileName;
+ String baseName = rawBaseName.replace(" ", "_");
imagesDirectory = config.getOutputFolder() + File.separator + baseName + MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add a regression test for spaced filenames (and multi-dot names).

Please add/update integration coverage to validate default image directory naming for inputs like "my paper.v1.pdf" (expect "my_paper.v1_images"), since current ImageDirIntegrationTest only covers 1901.03003.pdf.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`
around lines 327 - 331, Add an integration test in ImageDirIntegrationTest to
cover filenames with spaces and multiple dots (e.g., "my paper.v1.pdf") and
assert that DocumentProcessor constructs imagesDirectory using the sanitized
base name (DocumentProcessor: variable baseName/imagesDirectory) as
"my_paper.v1_images"; update or add tests to instantiate
DocumentProcessor.process (or the relevant method that sets imagesDirectory)
with such inputPdfName and verify the output folder path matches expected
MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX behavior for both spaced and multi-dot
names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants