feat: add --dry-run flag to preview extraction without writing files#413
feat: add --dry-run flag to preview extraction without writing files#413saheersk wants to merge 2 commits intoopendataloader-project:mainfrom
Conversation
Adds a --dry-run CLI option that runs the full PDF parsing and extraction pipeline but skips all file I/O. Instead it prints a summary to stdout: page count, element count, reading order, table method, output formats, and the output directory that would have been used. Useful for validating extraction settings and debugging without polluting the filesystem or needing to clean up afterward. - Config: dryRun field + isDryRun() / setDryRun() - CLIOptions: --dry-run option definition (exported), wired into createConfigFromCommandLine() - DocumentProcessor: early return in generateOutputs() + printDryRunSummary() - Synced options.json and generated Python / Node.js bindings - CLIOptionsTest: three unit tests covering default, flag present, option registered
WalkthroughAdds a new --dry-run CLI option across Java, TypeScript, and Python; exposes it on Config; and makes DocumentProcessor generate a summary and return early when dry-run is enabled (no output files written). Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User as User
participant CLI as CLI Parser
participant Config as Config
participant Processor as DocumentProcessor
participant FS as FileSystem
User->>CLI: invoke tool with --dry-run
CLI->>Config: build and set options (dryRun=true)
CLI->>Processor: call generateOutputs(pagesToProcess, config)
Processor->>Config: read isDryRun()
alt dry-run == true
Processor->>Processor: compute totals (pages, elements, formats)
Processor->>CLI: printDryRunSummary(...)
Processor-->>User: "No files written."
else dry-run == false
Processor->>FS: create dirs / open writers
Processor->>FS: write files (images, pdf, json, md, html)
Processor-->>User: outputs written
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java`:
- Around line 366-380: printDryRunSummary currently computes processedPages
using contents.size(), which is wrong for filtered runs; change the method
signature of printDryRunSummary(String inputPdfName, List<List<IObject>>
contents, Config config) to accept the validated pagesToProcess (int
pagesToProcess) and use that value for processedPages instead of
contents.size(); update all callers (where printDryRunSummary is invoked) to
pass the validated pagesToProcess value produced during page validation so the
summary reflects the actual pages to be processed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5e87f82d-e73b-47b2-9c8b-d6ad2984e8bb
📒 Files selected for processing (8)
java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIOptions.javajava/opendataloader-pdf-cli/src/test/java/org/opendataloader/pdf/cli/CLIOptionsTest.javajava/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.javajava/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.javanode/opendataloader-pdf/src/cli-options.generated.tsnode/opendataloader-pdf/src/convert-options.generated.tsoptions.jsonpython/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py
…f contents.size()
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@content/docs/cli-options-reference.mdx`:
- Line 42: The docs row for `--dry-run` diverges from the source-of-truth in
options.json; update the `dry-run` entry in options.json so its description
matches the intended text shown in the docs (the sentence beginning "Parse and
analyze the PDF without writing any output files..."), then regenerate the
documentation artifacts (run the project's docs generation/build script) so
content/docs/cli-options-reference.mdx is re-synced; verify the updated
description appears in the regenerated file and commit both the options.json
change and the regenerated docs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a82ef120-6bd7-48b4-8bbb-45e861e6035f
📒 Files selected for processing (2)
content/docs/cli-options-reference.mdxjava/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
|
Thanks for the thorough work — bindings synced, tests included, docs updated. Appreciated. We've posted our thoughts on the alternative approach in #412. Closing this PR in favor of that discussion, but if the community identifies a workflow where the temp-directory workaround falls short, we're open to revisiting. Thanks again for the contribution! |
|
For reference, the alternative approach we recommend today: opendataloader-pdf input.pdf -o /tmp/odl-preview -f markdown
cat /tmp/odl-preview/*.md
rm -rf /tmp/odl-previewMore context in #412. |
Adds a --dry-run CLI option that runs the full PDF parsing and extraction
pipeline but skips all file I/O. Instead it prints a summary to stdout:
page count, element count, reading order, table method, output formats,
and the output directory that would have been used.
Useful for validating extraction settings and debugging without
polluting the filesystem or needing to clean up afterward.
Checklist:
Summary by CodeRabbit
New Features
--dry-runCLI option to preview PDF analysis without writing output files.Documentation
--dry-rundescription and behavior.Tests
--dry-runoption presence and Config behavior.