Add markdown table output modes by StevenVincentOne · Pull Request #317 · opendataloader-project/opendataloader-pdf

StevenVincentOne · 2026-03-19T21:59:12Z

Summary

Add native markdown table output modes:

full (current behavior)
caption_only
off

This lets downstream consumers preserve table captions while omitting table bodies from markdown output when they care more about reading order and prose continuity than inline table fidelity.

Motivation

Complex table regions can degrade nearby markdown output. In voice-first or lightweight reading workflows, caption-only output is often preferable to full table reconstruction.

This PR keeps full unchanged and adds safer alternatives for consumers that want to suppress table content during markdown generation rather than stripping it later.

What changed

Added markdownTableOutput config with full, caption_only, and off
Added CLI support for --markdown-table-output
Updated markdown generation so:
- full keeps current output
- caption_only keeps captions and omits table bodies
- off omits both table bodies and captions
Added tests for all three modes
Added caption-only cleanup for table-adjacent flattened artifacts that can spill across page boundaries in markdown output

Reproduction

Source PDF:

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI et al.

Current raw markdown shape in `full`

On DeepSeek_R1.pdf, the default markdown output around Table 4 includes flattened benchmark/table spillover between real subsection headings:

##### 3.1. DeepSeek-R1 Evaluation

- 1https://example.com
- 2https://example.com
- 3https://example.com

Claude-3.5- GPT-4o DeepSeek OpenAI OpenAI DeepSeek
...
Table 4 | Comparison between DeepSeek-R1 and other representative models.
...
performance of DeepSeek-R1 will improve in the next version...

##### 3.2. Distilled Model Evaluation

Expected shape in `caption_only`

##### 3.1. DeepSeek-R1 Evaluation

Table 4 | Comparison between DeepSeek-R1 and other representative models.

##### 3.2. Distilled Model Evaluation

Result with this PR

Using --markdown-table-output caption_only on the same PDF yields:

##### 3.1. DeepSeek-R1 Evaluation

Table 4 | Comparison between DeepSeek-R1 and other representative models.

##### 3.2. Distilled Model Evaluation

It also preserves later captions like Table 5 | ... while omitting the table body.

Notes

full is intentionally unchanged by design.
This PR focuses on native output control for markdown consumers.
It does not try to fully solve default full-fidelity table reconstruction quality.

Validation

Tested with:

mvn -f java/pom.xml -pl opendataloader-pdf-core,opendataloader-pdf-cli -am -DskipTests -DfailIfNoTests=false package
mvn -f java/pom.xml -pl opendataloader-pdf-core,opendataloader-pdf-cli -am test -DfailIfNoTests=false
real-doc smoke on DeepSeek_R1.pdf with:
- --markdown-table-output full
- --markdown-table-output caption_only

CLAassistant · 2026-03-19T21:59:20Z

All committers have signed the CLA.

StevenVincentOne · 2026-03-19T22:21:37Z

I’ve rewritten the branch commits to use my GitHub-linked email (stevenvincentone@icloud.com) instead of the wrong address that was on the earlier commits, so the CLA check should be able to re-evaluate correctly.

bundolee

Code Review

Critical (must fix before merge)

1. off mode semantics are contradictory

The PR description says off should "omit both table bodies and captions," but the test testOffTableOutputKeepsCaptionAndOmitsTableBody asserts that captions ARE kept. Meanwhile, collectTableArtifactIndices DOES skip captions when isTableOutputOff() is true. This is a logic conflict — either the test or the implementation is wrong.

2. Verify markdownTableOutput is initialized from Config in the constructor

The constructor must include this.markdownTableOutput = config.getMarkdownTableOutput();. If not wired, the field stays "full" forever and the entire feature is dead code in production.

3. Verify MarkdownHTMLGenerator respects the new modes end-to-end

The writeTable() guard is added, but the cross-page artifact cleanup in writeToMarkdown is inherited. Confirm the HTML generator path works correctly with all three modes.

Important (should fix)

4. Hardcoded domain-specific patterns are a maintainability risk

BENCHMARK_PATTERN, MODEL_NAME_PATTERN, TABLE_HEADER_TEXT_PATTERN, and MODELISH_LINE_PATTERN are all tuned for AI benchmark papers (DeepSeek R1). These will:

False-positive on PDFs that discuss AI models in prose
False-negative on any other table type (financial, medical, scientific)

Consider generalizing to structural patterns only, or at minimum document the narrow scope prominently.

5. looksTableArtifactText heuristics are aggressive

"4+ numbers that aren't narrative" would match legitimate financial/statistical content
shouldSkipDanglingNarrativeFragment skipping lowercase-starting text is risky (e.g., "iPhone", "eBay", continuation paragraphs after formulas)

6. Thread safety

markdownTableOutputOptions is a mutable HashSet in a static field. Consider Set.of() or Collections.unmodifiableSet(). (Inherited pattern from existing code, but worth noting.)

Suggestions

Verify npm run sync was run per CLAUDE.md to ensure generated TS/Python bindings are correct
Consider separating detection from mutation in walkTableArtifactRange for testability
Tests mirror implementation logic rather than testing through the public API — consider exercising MarkdownGenerator through its public methods instead

Overall

The three-mode architecture is clean and follows project patterns well. full mode being unchanged is the right conservative default. The critical issues (especially the off mode contradiction) must be resolved before merge, and the hardcoded AI-benchmark patterns should be generalized or documented.

🤖 Generated with Claude Code

bundolee · 2026-03-24T01:43:41Z

Thanks for this well-structured PR! The Config/CLI plumbing is excellent — follows project conventions perfectly, and the test coverage for Config and CLI layers is thorough.

However, I have concerns about the artifact detection heuristics in MarkdownGenerator that I'd like to discuss before merging.

What we'd like to accept:

Config/CLI option (--markdown-table-output full|caption_only|off) ✅
shouldWriteTableBody() guard in MarkdownGenerator and MarkdownHTMLGenerator ✅
All tests for Config and CLIOptions ✅
Documentation and cross-language binding updates ✅

What needs rework — domain-specific heuristics (~240 lines):

The artifact detection patterns are hardcoded for AI/ML benchmark papers:

BENCHMARK_PATTERN: "AIME 2024", "MATH-500", "GPQA", "LiveCodeBench", etc.
MODEL_NAME_PATTERN: "GPT-4o", "Claude", "QwQ-32B", etc.
TABLE_HEADER_TEXT_PATTERN: "model", "english", "math", "chinese"

As a general-purpose PDF parser, these will cause problems:

False positives — "Model" as a standalone word matches in economics, statistics, and engineering papers. \d{3,4} matches any 3-4 digit number.
Rapid obsolescence — New benchmarks/models appear monthly; these patterns are frozen to early 2025.
Separation of concerns — A PDF parser shouldn't embed domain knowledge about AI benchmarks.

Also:

off mode: code removes captions but test testOffTableOutputKeepsCaptionAndOmitsTableBody asserts captions are kept — contradictory.
looksNarrativeText: period in URLs/decimals triggers false matches.
Inline regexes in looksTableArtifactText should be precompiled patterns.
Fully qualified java.util.regex.Matcher and java.util.ArrayList should use imports.

Suggested path forward:
Keep the clean structural parts (Config, CLI, shouldWriteTableBody() guard, PIPE_ROW_PATTERN, NUMERIC_ONLY_PATTERN). Remove the domain-specific heuristics. If artifact cleanup is needed for specific use cases, consider making it a configurable regex parameter rather than baking it into the library.

Would you be open to this scoped-down approach?

bundolee · 2026-04-15T10:27:00Z

Closing this due to inactivity — it's been 3 weeks since the review with no response, and the branch now has merge conflicts.

The Config/CLI plumbing and three-mode architecture were well done. If you'd like to revisit this, feel free to reopen or submit a new PR with the scoped-down approach (removing the domain-specific heuristics). Happy to re-review.

Thanks for the contribution!

StevenVincentOne requested review from LonelyMidoriya, MaximPlusov, bundolee and hyunhee-jo as code owners March 19, 2026 21:59

StevenVincentOne mentioned this pull request Mar 19, 2026

Improve full markdown output around complex table regions #318

Open

StevenVincentOne added 2 commits March 19, 2026 15:21

feat: add markdown table output modes

e3ed261

fix: suppress table caption artifacts across page boundaries

cf8829e

StevenVincentOne force-pushed the codex/table-output-modes branch from 3a3431b to cf8829e Compare March 19, 2026 22:21

StevenVincentOne mentioned this pull request Mar 19, 2026

fix: recover flattened benchmark tables in markdown output #262

Closed

bundolee reviewed Mar 20, 2026

View reviewed changes

bundolee closed this Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add markdown table output modes#317

Add markdown table output modes#317
StevenVincentOne wants to merge 2 commits intoopendataloader-project:mainfrom
StevenVincentOne:codex/table-output-modes

StevenVincentOne commented Mar 19, 2026

Uh oh!

CLAassistant commented Mar 19, 2026 •

edited

Loading

Uh oh!

StevenVincentOne commented Mar 19, 2026

Uh oh!

bundolee left a comment

Uh oh!

bundolee commented Mar 24, 2026

Uh oh!

bundolee commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

StevenVincentOne commented Mar 19, 2026

Summary

Motivation

What changed

Reproduction

Current raw markdown shape in full

Expected shape in caption_only

Result with this PR

Notes

Validation

Uh oh!

CLAassistant commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StevenVincentOne commented Mar 19, 2026

Uh oh!

bundolee left a comment

Choose a reason for hiding this comment

Code Review

Critical (must fix before merge)

Important (should fix)

Suggestions

Overall

Uh oh!

bundolee commented Mar 24, 2026

Uh oh!

bundolee commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Current raw markdown shape in `full`

Expected shape in `caption_only`

CLAassistant commented Mar 19, 2026 •

edited

Loading