After changing CLI options in Java, must run npm run sync — this regenerates options.json and all Python/Node.js bindings. Forgetting this silently breaks the wrappers.
When using --enrich-formula or --enrich-picture-description on the hybrid server, the client must use --hybrid-mode full. Otherwise enrichments are silently skipped (they only run on the backend, not in Java).
Processing uses ForkJoinPool(availableProcessors) for per-page parallelism. All StaticContainers and StaticLayoutContainers ThreadLocal state must be propagated to worker threads via propagateState.run() — missing a ThreadLocal causes silent data loss or NPE in parallel mode.
Hidden text detection (--filter-hidden-text) is off by default — it requires per-page PDF rendering via ContrastRatioConsumer which cannot be parallelized safely.
--format values name output file kinds only (json, text, html, pdf, markdown, tagged-pdf). Markdown rendering modifiers and image extraction are separate flags: --markdown-with-html for HTML-in-Markdown, --image-output off|embedded|external for image control. markdown-with-html and markdown-with-images are still accepted as --format values for one major release but emit a deprecation warning and will be removed.
Manual docs live in opendataloader.org repo. Reference docs (CLI options, JSON schema) are auto-generated by CI at release time and pushed to opendataloader.org. No MDX files are tracked in this repo.
./scripts/bench.sh— Run benchmark (auto-clones opendataloader-bench for PDFs and evaluation logic)./scripts/bench.sh --doc-id <id>— Debug specific document./scripts/bench.sh --check-regression— CI mode with threshold check- Benchmark code lives in opendataloader-bench
- Metrics: NID (reading order), TEDS (table structure), MHS (heading structure), Table Detection F1, Speed