Add markdown table output modes#317
Add markdown table output modes#317StevenVincentOne wants to merge 2 commits intoopendataloader-project:mainfrom
Conversation
3a3431b to
cf8829e
Compare
|
I’ve rewritten the branch commits to use my GitHub-linked email ( |
bundolee
left a comment
There was a problem hiding this comment.
Code Review
Critical (must fix before merge)
1. off mode semantics are contradictory
The PR description says off should "omit both table bodies and captions," but the test testOffTableOutputKeepsCaptionAndOmitsTableBody asserts that captions ARE kept. Meanwhile, collectTableArtifactIndices DOES skip captions when isTableOutputOff() is true. This is a logic conflict — either the test or the implementation is wrong.
2. Verify markdownTableOutput is initialized from Config in the constructor
The constructor must include this.markdownTableOutput = config.getMarkdownTableOutput();. If not wired, the field stays "full" forever and the entire feature is dead code in production.
3. Verify MarkdownHTMLGenerator respects the new modes end-to-end
The writeTable() guard is added, but the cross-page artifact cleanup in writeToMarkdown is inherited. Confirm the HTML generator path works correctly with all three modes.
Important (should fix)
4. Hardcoded domain-specific patterns are a maintainability risk
BENCHMARK_PATTERN, MODEL_NAME_PATTERN, TABLE_HEADER_TEXT_PATTERN, and MODELISH_LINE_PATTERN are all tuned for AI benchmark papers (DeepSeek R1). These will:
- False-positive on PDFs that discuss AI models in prose
- False-negative on any other table type (financial, medical, scientific)
Consider generalizing to structural patterns only, or at minimum document the narrow scope prominently.
5. looksTableArtifactText heuristics are aggressive
- "4+ numbers that aren't narrative" would match legitimate financial/statistical content
shouldSkipDanglingNarrativeFragmentskipping lowercase-starting text is risky (e.g., "iPhone", "eBay", continuation paragraphs after formulas)
6. Thread safety
markdownTableOutputOptions is a mutable HashSet in a static field. Consider Set.of() or Collections.unmodifiableSet(). (Inherited pattern from existing code, but worth noting.)
Suggestions
- Verify
npm run syncwas run per CLAUDE.md to ensure generated TS/Python bindings are correct - Consider separating detection from mutation in
walkTableArtifactRangefor testability - Tests mirror implementation logic rather than testing through the public API — consider exercising
MarkdownGeneratorthrough its public methods instead
Overall
The three-mode architecture is clean and follows project patterns well. full mode being unchanged is the right conservative default. The critical issues (especially the off mode contradiction) must be resolved before merge, and the hardcoded AI-benchmark patterns should be generalized or documented.
🤖 Generated with Claude Code
|
Thanks for this well-structured PR! The Config/CLI plumbing is excellent — follows project conventions perfectly, and the test coverage for Config and CLI layers is thorough. However, I have concerns about the artifact detection heuristics in What we'd like to accept:
What needs rework — domain-specific heuristics (~240 lines): The artifact detection patterns are hardcoded for AI/ML benchmark papers:
As a general-purpose PDF parser, these will cause problems:
Also:
Suggested path forward: Would you be open to this scoped-down approach? |
|
Closing this due to inactivity — it's been 3 weeks since the review with no response, and the branch now has merge conflicts. The Config/CLI plumbing and three-mode architecture were well done. If you'd like to revisit this, feel free to reopen or submit a new PR with the scoped-down approach (removing the domain-specific heuristics). Happy to re-review. Thanks for the contribution! |
Summary
Add native markdown table output modes:
full(current behavior)caption_onlyoffThis lets downstream consumers preserve table captions while omitting table bodies from markdown output when they care more about reading order and prose continuity than inline table fidelity.
Motivation
Complex table regions can degrade nearby markdown output. In voice-first or lightweight reading workflows, caption-only output is often preferable to full table reconstruction.
This PR keeps
fullunchanged and adds safer alternatives for consumers that want to suppress table content during markdown generation rather than stripping it later.What changed
markdownTableOutputconfig withfull,caption_only, andoff--markdown-table-outputfullkeeps current outputcaption_onlykeeps captions and omits table bodiesoffomits both table bodies and captionsReproduction
Source PDF:
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningCurrent raw markdown shape in
fullOn
DeepSeek_R1.pdf, the default markdown output aroundTable 4includes flattened benchmark/table spillover between real subsection headings:Expected shape in
caption_onlyResult with this PR
Using
--markdown-table-output caption_onlyon the same PDF yields:It also preserves later captions like
Table 5 | ...while omitting the table body.Notes
fullis intentionally unchanged by design.Validation
Tested with:
mvn -f java/pom.xml -pl opendataloader-pdf-core,opendataloader-pdf-cli -am -DskipTests -DfailIfNoTests=false packagemvn -f java/pom.xml -pl opendataloader-pdf-core,opendataloader-pdf-cli -am test -DfailIfNoTests=falseDeepSeek_R1.pdfwith:--markdown-table-output full--markdown-table-output caption_only