feat: add --dry-run flag to preview extraction without writing files by saheersk · Pull Request #413 · opendataloader-project/opendataloader-pdf

saheersk · 2026-04-12T03:14:34Z

Adds a --dry-run CLI option that runs the full PDF parsing and extraction
pipeline but skips all file I/O. Instead it prints a summary to stdout:
page count, element count, reading order, table method, output formats,
and the output directory that would have been used.

Useful for validating extraction settings and debugging without
polluting the filesystem or needing to clean up afterward.

Config: dryRun field + isDryRun() / setDryRun()
CLIOptions: --dry-run option definition (exported), wired into createConfigFromCommandLine()
DocumentProcessor: early return in generateOutputs() + printDryRunSummary()
Synced options.json and generated Python / Node.js bindings
CLIOptionsTest: three unit tests covering default, flag present, option registered

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

Summary by CodeRabbit

New Features
- Added --dry-run CLI option to preview PDF analysis without writing output files.
- Dry-run displays a summary: page count, element count, reading order, table method, output formats, and output folder.
Documentation
- CLI reference updated with --dry-run description and behavior.
Tests
- Added unit tests validating --dry-run option presence and Config behavior.

Adds a --dry-run CLI option that runs the full PDF parsing and extraction pipeline but skips all file I/O. Instead it prints a summary to stdout: page count, element count, reading order, table method, output formats, and the output directory that would have been used. Useful for validating extraction settings and debugging without polluting the filesystem or needing to clean up afterward. - Config: dryRun field + isDryRun() / setDryRun() - CLIOptions: --dry-run option definition (exported), wired into createConfigFromCommandLine() - DocumentProcessor: early return in generateOutputs() + printDryRunSummary() - Synced options.json and generated Python / Node.js bindings - CLIOptionsTest: three unit tests covering default, flag present, option registered

CLAassistant · 2026-04-12T03:14:45Z

All committers have signed the CLA.

coderabbitai · 2026-04-12T03:14:53Z

Walkthrough

Adds a new --dry-run CLI option across Java, TypeScript, and Python; exposes it on Config; and makes DocumentProcessor generate a summary and return early when dry-run is enabled (no output files written).

Changes

Cohort / File(s)	Summary
Java CLI Layer `java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIOptions.java`, `java/opendataloader-pdf-cli/src/test/java/org/opendataloader/pdf/cli/CLIOptionsTest.java`	Registered `--dry-run` option and mapped it in `applyHybridOptions()` to `config.setDryRun(true)`; added tests verifying option presence and Config behavior with/without the flag.
Java Core Configuration `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java`	Added `dryRun` boolean field with `isDryRun()` and `setDryRun(boolean)` accessors.
Java Document Processor `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`	Extended `generateOutputs(...)` to accept `pagesToProcess` and, when `config.isDryRun()` is true, call `printDryRunSummary(...)` and return early (skip all file/image/JSON/MD/HTML writing).
TypeScript/Node Options `node/opendataloader-pdf/src/cli-options.generated.ts`, `node/opendataloader-pdf/src/convert-options.generated.ts`	Added `--dry-run` Commander option; added optional `dryRun?: boolean` to `ConvertOptions`/`CliOptions`; propagated flag through `buildConvertOptions()` and `buildArgs()`.
Python Options `python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py`	Added `--dry-run` (python_name:`dry_run`) boolean CLI option entry to generated `CLI_OPTIONS` (default False).
Config Schema / Docs `options.json`, `content/docs/cli-options-reference.mdx`	Added `dry-run` option definition to `options.json` and documented `--dry-run` in CLI options reference (prints summary, no file writes).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User
  participant CLI as CLI Parser
  participant Config as Config
  participant Processor as DocumentProcessor
  participant FS as FileSystem

  User->>CLI: invoke tool with --dry-run
  CLI->>Config: build and set options (dryRun=true)
  CLI->>Processor: call generateOutputs(pagesToProcess, config)
  Processor->>Config: read isDryRun()
  alt dry-run == true
    Processor->>Processor: compute totals (pages, elements, formats)
    Processor->>CLI: printDryRunSummary(...)
    Processor-->>User: "No files written."
  else dry-run == false
    Processor->>FS: create dirs / open writers
    Processor->>FS: write files (images, pdf, json, md, html)
    Processor-->>User: outputs written
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related issues

Add --dry-run flag to preview PDF processing without writing output #412: Implements the same --dry-run feature (CLI flag, Config wiring, DocumentProcessor summary without writing outputs).

Possibly related PRs

perf: parallelize page processing — 6.5x faster #362: Similar changes adding boolean CLI flags and early-return behavior in DocumentProcessor.generateOutputs().

Suggested reviewers

bundolee
MaximPlusov
LonelyMidoriya
hyunhee-jo

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main feature: adding a --dry-run flag to preview extraction without writing files, which accurately reflects the primary purpose of all changes across the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`:
- Around line 366-380: printDryRunSummary currently computes processedPages
using contents.size(), which is wrong for filtered runs; change the method
signature of printDryRunSummary(String inputPdfName, List<List<IObject>>
contents, Config config) to accept the validated pagesToProcess (int
pagesToProcess) and use that value for processedPages instead of
contents.size(); update all callers (where printDryRunSummary is invoked) to
pass the validated pagesToProcess value produced during page validation so the
summary reflects the actual pages to be processed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5e87f82d-e73b-47b2-9c8b-d6ad2984e8bb

📥 Commits

Reviewing files that changed from the base of the PR and between 85fa506 and 1fd539e.

📒 Files selected for processing (8)

java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIOptions.java
java/opendataloader-pdf-cli/src/test/java/org/opendataloader/pdf/cli/CLIOptionsTest.java
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
node/opendataloader-pdf/src/cli-options.generated.ts
node/opendataloader-pdf/src/convert-options.generated.ts
options.json
python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py

…f contents.size()

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@content/docs/cli-options-reference.mdx`:
- Line 42: The docs row for `--dry-run` diverges from the source-of-truth in
options.json; update the `dry-run` entry in options.json so its description
matches the intended text shown in the docs (the sentence beginning "Parse and
analyze the PDF without writing any output files..."), then regenerate the
documentation artifacts (run the project's docs generation/build script) so
content/docs/cli-options-reference.mdx is re-synced; verify the updated
description appears in the regenerated file and commit both the options.json
change and the regenerated docs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a82ef120-6bd7-48b4-8bbb-45e861e6035f

📥 Commits

Reviewing files that changed from the base of the PR and between 1fd539e and 4a1457a.

📒 Files selected for processing (2)

content/docs/cli-options-reference.mdx
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

bundolee · 2026-04-15T10:59:20Z

Thanks for the thorough work — bindings synced, tests included, docs updated. Appreciated.

We've posted our thoughts on the alternative approach in #412. Closing this PR in favor of that discussion, but if the community identifies a workflow where the temp-directory workaround falls short, we're open to revisiting.

Thanks again for the contribution!

bundolee · 2026-04-15T10:59:40Z

For reference, the alternative approach we recommend today:

opendataloader-pdf input.pdf -o /tmp/odl-preview -f markdown
cat /tmp/odl-preview/*.md
rm -rf /tmp/odl-preview

More context in #412.

saheersk requested review from LonelyMidoriya, MaximPlusov, bundolee, hnc-jglee and hyunhee-jo as code owners April 12, 2026 03:14

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Comment thread ...endataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java Outdated

fix: use validated pagesToProcess count in dry-run summary instead o…

4a1457a

…f contents.size()

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Comment thread content/docs/cli-options-reference.mdx

bundolee closed this Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --dry-run flag to preview extraction without writing files#413

feat: add --dry-run flag to preview extraction without writing files#413
saheersk wants to merge 2 commits intoopendataloader-project:mainfrom
saheersk:feat/412

saheersk commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

CLAassistant commented Apr 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

bundolee commented Apr 15, 2026

Uh oh!

bundolee commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saheersk commented Apr 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

CLAassistant commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bundolee commented Apr 15, 2026

Uh oh!

bundolee commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saheersk commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

CLAassistant commented Apr 12, 2026 •

edited

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading