Skip to content

feat: add --dry-run flag to preview extraction without writing files#413

Closed
saheersk wants to merge 2 commits intoopendataloader-project:mainfrom
saheersk:feat/412
Closed

feat: add --dry-run flag to preview extraction without writing files#413
saheersk wants to merge 2 commits intoopendataloader-project:mainfrom
saheersk:feat/412

Conversation

@saheersk
Copy link
Copy Markdown

@saheersk saheersk commented Apr 12, 2026

Adds a --dry-run CLI option that runs the full PDF parsing and extraction
pipeline but skips all file I/O. Instead it prints a summary to stdout:
page count, element count, reading order, table method, output formats,
and the output directory that would have been used.

Useful for validating extraction settings and debugging without
polluting the filesystem or needing to clean up afterward.

  • Config: dryRun field + isDryRun() / setDryRun()
  • CLIOptions: --dry-run option definition (exported), wired into createConfigFromCommandLine()
  • DocumentProcessor: early return in generateOutputs() + printDryRunSummary()
  • Synced options.json and generated Python / Node.js bindings
  • CLIOptionsTest: three unit tests covering default, flag present, option registered

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Summary by CodeRabbit

  • New Features

    • Added --dry-run CLI option to preview PDF analysis without writing output files.
    • Dry-run displays a summary: page count, element count, reading order, table method, output formats, and output folder.
  • Documentation

    • CLI reference updated with --dry-run description and behavior.
  • Tests

    • Added unit tests validating --dry-run option presence and Config behavior.

  Adds a --dry-run CLI option that runs the full PDF parsing and extraction
  pipeline but skips all file I/O. Instead it prints a summary to stdout:
  page count, element count, reading order, table method, output formats,
  and the output directory that would have been used.

  Useful for validating extraction settings and debugging without
  polluting the filesystem or needing to clean up afterward.

  - Config: dryRun field + isDryRun() / setDryRun()
  - CLIOptions: --dry-run option definition (exported), wired into createConfigFromCommandLine()
  - DocumentProcessor: early return in generateOutputs() + printDryRunSummary()
  - Synced options.json and generated Python / Node.js bindings
  - CLIOptionsTest: three unit tests covering default, flag present, option registered
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 12, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

Walkthrough

Adds a new --dry-run CLI option across Java, TypeScript, and Python; exposes it on Config; and makes DocumentProcessor generate a summary and return early when dry-run is enabled (no output files written).

Changes

Cohort / File(s) Summary
Java CLI Layer
java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIOptions.java, java/opendataloader-pdf-cli/src/test/java/org/opendataloader/pdf/cli/CLIOptionsTest.java
Registered --dry-run option and mapped it in applyHybridOptions() to config.setDryRun(true); added tests verifying option presence and Config behavior with/without the flag.
Java Core Configuration
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java
Added dryRun boolean field with isDryRun() and setDryRun(boolean) accessors.
Java Document Processor
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
Extended generateOutputs(...) to accept pagesToProcess and, when config.isDryRun() is true, call printDryRunSummary(...) and return early (skip all file/image/JSON/MD/HTML writing).
TypeScript/Node Options
node/opendataloader-pdf/src/cli-options.generated.ts, node/opendataloader-pdf/src/convert-options.generated.ts
Added --dry-run Commander option; added optional dryRun?: boolean to ConvertOptions/CliOptions; propagated flag through buildConvertOptions() and buildArgs().
Python Options
python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py
Added --dry-run (python_name:dry_run) boolean CLI option entry to generated CLI_OPTIONS (default False).
Config Schema / Docs
options.json, content/docs/cli-options-reference.mdx
Added dry-run option definition to options.json and documented --dry-run in CLI options reference (prints summary, no file writes).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User
  participant CLI as CLI Parser
  participant Config as Config
  participant Processor as DocumentProcessor
  participant FS as FileSystem

  User->>CLI: invoke tool with --dry-run
  CLI->>Config: build and set options (dryRun=true)
  CLI->>Processor: call generateOutputs(pagesToProcess, config)
  Processor->>Config: read isDryRun()
  alt dry-run == true
    Processor->>Processor: compute totals (pages, elements, formats)
    Processor->>CLI: printDryRunSummary(...)
    Processor-->>User: "No files written."
  else dry-run == false
    Processor->>FS: create dirs / open writers
    Processor->>FS: write files (images, pdf, json, md, html)
    Processor-->>User: outputs written
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • bundolee
  • MaximPlusov
  • LonelyMidoriya
  • hyunhee-jo
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main feature: adding a --dry-run flag to preview extraction without writing files, which accurately reflects the primary purpose of all changes across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`:
- Around line 366-380: printDryRunSummary currently computes processedPages
using contents.size(), which is wrong for filtered runs; change the method
signature of printDryRunSummary(String inputPdfName, List<List<IObject>>
contents, Config config) to accept the validated pagesToProcess (int
pagesToProcess) and use that value for processedPages instead of
contents.size(); update all callers (where printDryRunSummary is invoked) to
pass the validated pagesToProcess value produced during page validation so the
summary reflects the actual pages to be processed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5e87f82d-e73b-47b2-9c8b-d6ad2984e8bb

📥 Commits

Reviewing files that changed from the base of the PR and between 85fa506 and 1fd539e.

📒 Files selected for processing (8)
  • java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIOptions.java
  • java/opendataloader-pdf-cli/src/test/java/org/opendataloader/pdf/cli/CLIOptionsTest.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
  • node/opendataloader-pdf/src/cli-options.generated.ts
  • node/opendataloader-pdf/src/convert-options.generated.ts
  • options.json
  • python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@content/docs/cli-options-reference.mdx`:
- Line 42: The docs row for `--dry-run` diverges from the source-of-truth in
options.json; update the `dry-run` entry in options.json so its description
matches the intended text shown in the docs (the sentence beginning "Parse and
analyze the PDF without writing any output files..."), then regenerate the
documentation artifacts (run the project's docs generation/build script) so
content/docs/cli-options-reference.mdx is re-synced; verify the updated
description appears in the regenerated file and commit both the options.json
change and the regenerated docs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a82ef120-6bd7-48b4-8bbb-45e861e6035f

📥 Commits

Reviewing files that changed from the base of the PR and between 1fd539e and 4a1457a.

📒 Files selected for processing (2)
  • content/docs/cli-options-reference.mdx
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

Comment thread content/docs/cli-options-reference.mdx
@bundolee
Copy link
Copy Markdown
Contributor

Thanks for the thorough work — bindings synced, tests included, docs updated. Appreciated.

We've posted our thoughts on the alternative approach in #412. Closing this PR in favor of that discussion, but if the community identifies a workflow where the temp-directory workaround falls short, we're open to revisiting.

Thanks again for the contribution!

@bundolee bundolee closed this Apr 15, 2026
@bundolee
Copy link
Copy Markdown
Contributor

For reference, the alternative approach we recommend today:

opendataloader-pdf input.pdf -o /tmp/odl-preview -f markdown
cat /tmp/odl-preview/*.md
rm -rf /tmp/odl-preview

More context in #412.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants