Skip to content

Latest commit

 

History

History
25 lines (15 loc) · 1.99 KB

File metadata and controls

25 lines (15 loc) · 1.99 KB

CLAUDE.md

Gotchas

After changing CLI options in Java, must run npm run sync — this regenerates options.json and all Python/Node.js bindings. Forgetting this silently breaks the wrappers.

When using --enrich-formula or --enrich-picture-description on the hybrid server, the client must use --hybrid-mode full. Otherwise enrichments are silently skipped (they only run on the backend, not in Java).

Processing uses ForkJoinPool(availableProcessors) for per-page parallelism. All StaticContainers and StaticLayoutContainers ThreadLocal state must be propagated to worker threads via propagateState.run() — missing a ThreadLocal causes silent data loss or NPE in parallel mode.

Hidden text detection (--filter-hidden-text) is off by default — it requires per-page PDF rendering via ContrastRatioConsumer which cannot be parallelized safely.

--format values name output file kinds only (json, text, html, pdf, markdown, tagged-pdf). Markdown rendering modifiers and image extraction are separate flags: --markdown-with-html for HTML-in-Markdown, --image-output off|embedded|external for image control. markdown-with-html and markdown-with-images are still accepted as --format values for one major release but emit a deprecation warning and will be removed.

Conventions

Manual docs live in opendataloader.org repo. Reference docs (CLI options, JSON schema) are auto-generated by CI at release time and pushed to opendataloader.org. No MDX files are tracked in this repo.

Benchmark

  • ./scripts/bench.sh — Run benchmark (auto-clones opendataloader-bench for PDFs and evaluation logic)
  • ./scripts/bench.sh --doc-id <id> — Debug specific document
  • ./scripts/bench.sh --check-regression — CI mode with threshold check
  • Benchmark code lives in opendataloader-bench
  • Metrics: NID (reading order), TEDS (table structure), MHS (heading structure), Table Detection F1, Speed