Bug Fix by ankitsingh0913 · Pull Request #411 · opendataloader-project/opendataloader-pdf

ankitsingh0913 · 2026-04-11T09:54:56Z

I have addressed the issue where images extracted from PDFs with filenames containing spaces fail to render in Markdown format.

Following your suggestion, I updated DocumentProcessor.java to replace spaces with underscores in the image directory name before it is assigned to StaticLayoutContainers.

This ensures both the output directory and referenced paths use consistent naming (for example, my_paper_images instead of my paper_images), which resolves the broken links without requiring separate URL encoding logic for Markdown or HTML generators.

I also verified that the project compiles successfully after the change.

Please let me know if any additional changes or testing are needed.

Summary by CodeRabbit

Bug Fixes
- Image output directory names now preserve filename stems (handles names without extensions) and replace spaces with underscores to avoid malformed image folders when no custom image directory is set.
Chores
- Adjusted a dependency version range for improved resolution and compatibility.

CLAassistant · 2026-04-11T09:55:08Z

All committers have signed the CLA.

coderabbitai · 2026-04-11T09:55:18Z

Walkthrough

Default image output directory naming now uses the filename portion before the last dot and replaces spaces with underscores; verapdf Maven property upper bound changed from exclusive to inclusive (1.32.0-RC).

Changes

Cohort / File(s)	Summary
Image Directory Naming `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`	Changed logic that derives default `imagesDirectory`: extract part before last `.` (handles missing extensions) and replace spaces with underscores before appending `MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX`.
Dependency Version Update `java/pom.xml`	Updated `${verapdf.version}` range upper bound from `[1.31.0,1.32.0-RC)` to `[1.31.0,1.32.0-RC]` (inclusive).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

Image paths with spaces break markdown preview #405 — Replaces spaces with underscores in generated images directory names, matching the suggested fix to avoid spaces in image paths.

Suggested reviewers

MaximPlusov
LonelyMidoriya
hnc-jglee
hyunhee-jo

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Bug Fix' is generic and non-descriptive; it does not convey meaningful information about the specific changes (filename extraction and space handling in image directory naming).	Use a more descriptive title that summarizes the main change, such as 'Fix PDF image rendering by replacing spaces with underscores in directory names' or 'Handle filename extraction safely for image directories'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`:
- Line 328: DocumentProcessor currently uses
fileName.substring(fileName.length() - 4) to strip an extension into baseName
which fails for names with different extension lengths or no extension; change
the logic to find the last dot (fileName.lastIndexOf('.')) and take substring up
to that index (with a safe fallback to the whole fileName when no dot exists),
also guard for null/empty fileName, then apply the replace(" ", "_") on the
resulting baseName so you don't risk StringIndexOutOfBoundsException or
incorrect names.

In `@java/pom.xml`:
- Line 61: The verapdf.version Maven property currently uses an invalid/unstable
range "[1.31.0,1.32.0-RC]"; update the verapdf.version property to a stable,
published release (e.g., set verapdf.version to the exact stable "1.28.2") or
explicitly document the range if these are internal dev releases, and avoid
allowing pre-release qualifiers (RC) in release builds—alternatively constrain
the range to exclude prereleases or pin to an exact version to ensure
reproducible builds.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8db3829a-a1f8-4265-bbf7-1e9fdfce45fd

📥 Commits

Reviewing files that changed from the base of the PR and between 85fa506 and 334f58b.

📒 Files selected for processing (2)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
java/pom.xml

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`:
- Around line 327-331: Add an integration test in ImageDirIntegrationTest to
cover filenames with spaces and multiple dots (e.g., "my paper.v1.pdf") and
assert that DocumentProcessor constructs imagesDirectory using the sanitized
base name (DocumentProcessor: variable baseName/imagesDirectory) as
"my_paper.v1_images"; update or add tests to instantiate
DocumentProcessor.process (or the relevant method that sets imagesDirectory)
with such inputPdfName and verify the output folder path matches expected
MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX behavior for both spaced and multi-dot
names.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4d36c3dd-1cca-4b33-b02b-b622b48812fd

📥 Commits

Reviewing files that changed from the base of the PR and between 334f58b and c50e40e.

📒 Files selected for processing (1)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

coderabbitai · 2026-04-11T10:13:23Z

                String fileName = Paths.get(inputPdfName).getFileName().toString();
-                String baseName = fileName.substring(0, fileName.length() - 4);
+               int dotIndex = fileName.lastIndexOf('.');
+               String rawBaseName = dotIndex > 0 ? fileName.substring(0, dotIndex) : fileName;
+               String baseName = rawBaseName.replace(" ", "_");
                imagesDirectory = config.getOutputFolder() + File.separator + baseName + MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX;


🧹 Nitpick | 🔵 Trivial

Add a regression test for spaced filenames (and multi-dot names).

Please add/update integration coverage to validate default image directory naming for inputs like "my paper.v1.pdf" (expect "my_paper.v1_images"), since current ImageDirIntegrationTest only covers 1901.03003.pdf.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java` around lines 327 - 331, Add an integration test in ImageDirIntegrationTest to cover filenames with spaces and multiple dots (e.g., "my paper.v1.pdf") and assert that DocumentProcessor constructs imagesDirectory using the sanitized base name (DocumentProcessor: variable baseName/imagesDirectory) as "my_paper.v1_images"; update or add tests to instantiate DocumentProcessor.process (or the relevant method that sets imagesDirectory) with such inputPdfName and verify the output folder path matches expected MarkdownSyntax.IMAGES_DIRECTORY_SUFFIX behavior for both spaced and multi-dot names.

Bug Fix

334f58b

ankitsingh0913 requested review from LonelyMidoriya, MaximPlusov, bundolee, hnc-jglee and hyunhee-jo as code owners April 11, 2026 09:55

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

Comment thread ...endataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java Outdated

Comment thread java/pom.xml Outdated

ankitsingh0913 added 2 commits April 11, 2026 15:36

Fix: handle filename extraction safely using lastIndexOf

10b0bf9

Fix: Handle filename extraction safely using lastIndexOf

c50e40e

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

ankitsingh0913 and others added 2 commits April 11, 2026 15:48

Fix: Undone the verapdf version range change

1d1cae1

Merge branch 'main' into bugFix

21b78dd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fix#411

Bug Fix#411
ankitsingh0913 wants to merge 5 commits intoopendataloader-project:mainfrom
ankitsingh0913:bugFix

ankitsingh0913 commented Apr 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

CLAassistant commented Apr 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 11, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ankitsingh0913 commented Apr 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

CLAassistant commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ankitsingh0913 commented Apr 11, 2026 •

edited by coderabbitai bot

Loading

CLAassistant commented Apr 11, 2026 •

edited

Loading

coderabbitai bot commented Apr 11, 2026 •

edited

Loading