Skip to content

Add markdown table output modes#317

Closed
StevenVincentOne wants to merge 2 commits intoopendataloader-project:mainfrom
StevenVincentOne:codex/table-output-modes
Closed

Add markdown table output modes#317
StevenVincentOne wants to merge 2 commits intoopendataloader-project:mainfrom
StevenVincentOne:codex/table-output-modes

Conversation

@StevenVincentOne
Copy link
Copy Markdown
Contributor

Summary

Add native markdown table output modes:

  • full (current behavior)
  • caption_only
  • off

This lets downstream consumers preserve table captions while omitting table bodies from markdown output when they care more about reading order and prose continuity than inline table fidelity.

Motivation

Complex table regions can degrade nearby markdown output. In voice-first or lightweight reading workflows, caption-only output is often preferable to full table reconstruction.

This PR keeps full unchanged and adds safer alternatives for consumers that want to suppress table content during markdown generation rather than stripping it later.

What changed

  • Added markdownTableOutput config with full, caption_only, and off
  • Added CLI support for --markdown-table-output
  • Updated markdown generation so:
    • full keeps current output
    • caption_only keeps captions and omits table bodies
    • off omits both table bodies and captions
  • Added tests for all three modes
  • Added caption-only cleanup for table-adjacent flattened artifacts that can spill across page boundaries in markdown output

Reproduction

Source PDF:

  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
  • DeepSeek-AI et al.

Current raw markdown shape in full

On DeepSeek_R1.pdf, the default markdown output around Table 4 includes flattened benchmark/table spillover between real subsection headings:

##### 3.1. DeepSeek-R1 Evaluation

- 1https://example.com
- 2https://example.com
- 3https://example.com

Claude-3.5- GPT-4o DeepSeek OpenAI OpenAI DeepSeek
...
Table 4 | Comparison between DeepSeek-R1 and other representative models.
...
performance of DeepSeek-R1 will improve in the next version...

##### 3.2. Distilled Model Evaluation

Expected shape in caption_only

##### 3.1. DeepSeek-R1 Evaluation

Table 4 | Comparison between DeepSeek-R1 and other representative models.

##### 3.2. Distilled Model Evaluation

Result with this PR

Using --markdown-table-output caption_only on the same PDF yields:

##### 3.1. DeepSeek-R1 Evaluation

Table 4 | Comparison between DeepSeek-R1 and other representative models.

##### 3.2. Distilled Model Evaluation

It also preserves later captions like Table 5 | ... while omitting the table body.

Notes

  • full is intentionally unchanged by design.
  • This PR focuses on native output control for markdown consumers.
  • It does not try to fully solve default full-fidelity table reconstruction quality.

Validation

Tested with:

  • mvn -f java/pom.xml -pl opendataloader-pdf-core,opendataloader-pdf-cli -am -DskipTests -DfailIfNoTests=false package
  • mvn -f java/pom.xml -pl opendataloader-pdf-core,opendataloader-pdf-cli -am test -DfailIfNoTests=false
  • real-doc smoke on DeepSeek_R1.pdf with:
    • --markdown-table-output full
    • --markdown-table-output caption_only

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 19, 2026

CLA assistant check
All committers have signed the CLA.

@StevenVincentOne StevenVincentOne force-pushed the codex/table-output-modes branch from 3a3431b to cf8829e Compare March 19, 2026 22:21
@StevenVincentOne
Copy link
Copy Markdown
Contributor Author

I’ve rewritten the branch commits to use my GitHub-linked email (stevenvincentone@icloud.com) instead of the wrong address that was on the earlier commits, so the CLA check should be able to re-evaluate correctly.

Copy link
Copy Markdown
Contributor

@bundolee bundolee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Critical (must fix before merge)

1. off mode semantics are contradictory

The PR description says off should "omit both table bodies and captions," but the test testOffTableOutputKeepsCaptionAndOmitsTableBody asserts that captions ARE kept. Meanwhile, collectTableArtifactIndices DOES skip captions when isTableOutputOff() is true. This is a logic conflict — either the test or the implementation is wrong.

2. Verify markdownTableOutput is initialized from Config in the constructor

The constructor must include this.markdownTableOutput = config.getMarkdownTableOutput();. If not wired, the field stays "full" forever and the entire feature is dead code in production.

3. Verify MarkdownHTMLGenerator respects the new modes end-to-end

The writeTable() guard is added, but the cross-page artifact cleanup in writeToMarkdown is inherited. Confirm the HTML generator path works correctly with all three modes.


Important (should fix)

4. Hardcoded domain-specific patterns are a maintainability risk

BENCHMARK_PATTERN, MODEL_NAME_PATTERN, TABLE_HEADER_TEXT_PATTERN, and MODELISH_LINE_PATTERN are all tuned for AI benchmark papers (DeepSeek R1). These will:

  • False-positive on PDFs that discuss AI models in prose
  • False-negative on any other table type (financial, medical, scientific)

Consider generalizing to structural patterns only, or at minimum document the narrow scope prominently.

5. looksTableArtifactText heuristics are aggressive

  • "4+ numbers that aren't narrative" would match legitimate financial/statistical content
  • shouldSkipDanglingNarrativeFragment skipping lowercase-starting text is risky (e.g., "iPhone", "eBay", continuation paragraphs after formulas)

6. Thread safety

markdownTableOutputOptions is a mutable HashSet in a static field. Consider Set.of() or Collections.unmodifiableSet(). (Inherited pattern from existing code, but worth noting.)


Suggestions

  • Verify npm run sync was run per CLAUDE.md to ensure generated TS/Python bindings are correct
  • Consider separating detection from mutation in walkTableArtifactRange for testability
  • Tests mirror implementation logic rather than testing through the public API — consider exercising MarkdownGenerator through its public methods instead

Overall

The three-mode architecture is clean and follows project patterns well. full mode being unchanged is the right conservative default. The critical issues (especially the off mode contradiction) must be resolved before merge, and the hardcoded AI-benchmark patterns should be generalized or documented.

🤖 Generated with Claude Code

@bundolee
Copy link
Copy Markdown
Contributor

Thanks for this well-structured PR! The Config/CLI plumbing is excellent — follows project conventions perfectly, and the test coverage for Config and CLI layers is thorough.

However, I have concerns about the artifact detection heuristics in MarkdownGenerator that I'd like to discuss before merging.

What we'd like to accept:

  • Config/CLI option (--markdown-table-output full|caption_only|off) ✅
  • shouldWriteTableBody() guard in MarkdownGenerator and MarkdownHTMLGenerator ✅
  • All tests for Config and CLIOptions ✅
  • Documentation and cross-language binding updates ✅

What needs rework — domain-specific heuristics (~240 lines):

The artifact detection patterns are hardcoded for AI/ML benchmark papers:

  • BENCHMARK_PATTERN: "AIME 2024", "MATH-500", "GPQA", "LiveCodeBench", etc.
  • MODEL_NAME_PATTERN: "GPT-4o", "Claude", "QwQ-32B", etc.
  • TABLE_HEADER_TEXT_PATTERN: "model", "english", "math", "chinese"

As a general-purpose PDF parser, these will cause problems:

  1. False positives — "Model" as a standalone word matches in economics, statistics, and engineering papers. \d{3,4} matches any 3-4 digit number.
  2. Rapid obsolescence — New benchmarks/models appear monthly; these patterns are frozen to early 2025.
  3. Separation of concerns — A PDF parser shouldn't embed domain knowledge about AI benchmarks.

Also:

  • off mode: code removes captions but test testOffTableOutputKeepsCaptionAndOmitsTableBody asserts captions are kept — contradictory.
  • looksNarrativeText: period in URLs/decimals triggers false matches.
  • Inline regexes in looksTableArtifactText should be precompiled patterns.
  • Fully qualified java.util.regex.Matcher and java.util.ArrayList should use imports.

Suggested path forward:
Keep the clean structural parts (Config, CLI, shouldWriteTableBody() guard, PIPE_ROW_PATTERN, NUMERIC_ONLY_PATTERN). Remove the domain-specific heuristics. If artifact cleanup is needed for specific use cases, consider making it a configurable regex parameter rather than baking it into the library.

Would you be open to this scoped-down approach?

@bundolee
Copy link
Copy Markdown
Contributor

Closing this due to inactivity — it's been 3 weeks since the review with no response, and the branch now has merge conflicts.

The Config/CLI plumbing and three-mode architecture were well done. If you'd like to revisit this, feel free to reopen or submit a new PR with the scoped-down approach (removing the domain-specific heuristics). Happy to re-review.

Thanks for the contribution!

@bundolee bundolee closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants