Differences from XY-Cut++ paper #188

ethnhll · 2026-02-07T06:41:40Z

ethnhll
Feb 7, 2026

Hi, I was wondering if you could provide some background on the chosen differences between the paper for xy-cut++ and your implementation of it. Chiefly, I am looking at the remerging of elements at the end into a sorted order. The paper goes through a cross-modal matching approach that I see is not part of opendataloader's implementation.

The reason I'm asking is, id like to understand your though processes, or how you came to choose the implementation that you did. I have been struggling quite a bit with trying to interpret a specific part of the cross modal matching section in the paper (semantic-specific tuned weights). Any insights you might have to share about that portion of the paper would also be a great help :)

hnc-jglee · 2026-03-17T00:40:54Z

hnc-jglee
Mar 17, 2026
Maintainer

Hi @ethnhll, great question! Thanks for diving into the paper and comparing it with our implementation.

You're right that our XY-Cut++ implementation (XYCutPlusPlusSorter.java) deliberately omits the cross-modal matching step from the paper (arXiv:2504.10258). Here's the reasoning:

Why no cross-modal matching?

The paper's cross-modal matching is designed to align visual (image-based) and textual modalities — essentially reconciling what an OCR/vision model "sees" with what text extraction produces. In opendataloader-pdf, we work directly with structured PDF content (tagged PDF structure trees, text extracted from the PDF stream, etc.), so we already have reliable geometric and structural information without needing to bridge two separate modalities. The cross-modal alignment step becomes redundant when your source of truth is the PDF structure itself rather than a combination of vision + OCR pipelines.

What we do instead (Phase 4 — geometric remerging):

Our implementation handles cross-layout elements (headers, footers, spanning titles) through a purely geometric approach:

Phase 1 identifies cross-layout elements by width and overlap heuristics
Phase 3 recursively segments the remaining content using adaptive XY/YX-cuts based on density
Phase 4 reinserts cross-layout elements based on their Y-position relative to the sorted main content

This works well for our use case because the spatial coordinates from PDF extraction are precise — we don't have the noise/uncertainty that would come from vision-based detection.

Regarding semantic-specific tuned weights:

This is indeed one of the more opaque parts of the paper. Since we skipped cross-modal matching entirely, we didn't need to implement those weights. Our density threshold (0.9) and gap-based cut selection serve a similar purpose in practice — deciding cut direction and handling layout complexity — but through spatial heuristics rather than learned semantic weights.

For a more accurate and detailed answer, I'd like to invite @bundolee, who led this implementation, to share their insights.

If you have more specific questions about particular parts of the algorithm, feel free to ask!

0 replies

bundolee · 2026-03-17T06:00:45Z

bundolee
Mar 17, 2026
Maintainer

Hi @ethnhll, great question. As @hnc-jglee mentioned, I led this implementation. Let me share the context behind our design decisions.

Background

We started with the goal of faithfully implementing the full 4-stage pipeline from the XY-Cut++ paper (arXiv:2504.10258). opendataloader-pdf can infer semantic information from PDFs (titles, body text, tables, figures, etc.), so the paper's semantic label-based approach was technically feasible for us.

However, when we applied the full pipeline and ran benchmarks, enabling Stage 1 (Layout Detection) and Stage 4 (Cross-Modal Matching) actually decreased our scores. We went through multiple rounds of parameter tuning and architectural adjustments but couldn't achieve stable improvements, and eventually converged on the current form — essentially Phase 3c (recursive segmentation) only.

Note: Our experiment records from that period were not sufficiently preserved, so I can't provide exact benchmark numbers or failure cases here. What follows is a combination of what's verifiable from the code/commit history and what I recall from the process.

The paper's 4-stage pipeline

#	Stage	Description
1	Layout Detection (PP-DocLayout)	Vision-based detection model extracts bboxes + semantic labels
2	Pre-Mask Processing	Mask dynamic elements that cause L-shape problems
3	Multi-Granularity Segmentation	(3a) cross-layout detection, (3b) isolated element pre-segmentation, (3c) density-driven adaptive XY/YX recursive splitting
4	Cross-Modal Matching	Reinsert masked elements using semantic priority + adaptive distance metric with 4 geometric constraints + per-category tuned weights

What happened at each stage

Stage 1: Layout Detection — Tried, then removed

opendataloader-pdf infers semantic types (title, body, table, figure, etc.) from PDF structure trees. The approach differs from the paper's PP-DocLayout (structure-based vs. vision-based), but both can provide semantic labels. When we used these labels to drive Stages 2–4, benchmark scores did not improve and in some cases declined.

Stage 2: Pre-Mask (Cross-Layout Detection) — Disabled

Aspect	Paper	Our implementation
Width baseline	`beta × median(widths)`, beta=1.3	`beta × maxWidth`, beta=2.0
Effective behavior	Actively detects cross-layout elements	Effectively disabled — no element can be 2× wider than the widest element

With lower beta values, elements were misclassified as cross-layout, pulled out of recursive segmentation, and reinserted at incorrect positions. Stage 4 is supposed to compensate for this, but since Stage 4 itself wasn't working reliably in our environment, we disabled both together.

Stage 3: Multi-Granularity Segmentation — Core of our implementation

Sub-phase	Paper	Our implementation	Status
3a: Cross-Layout Detection	`beta × median(widths)` adaptive threshold	Tied to Stage 2 (disabled)	Disabled
3b: Geometric Pre-Segmentation	Detect isolated central elements → partition into sub-regions	—	Not implemented
3c: Density-Driven Refinement	`tau_d = cross/single area ratio`, θ=0.9	`contentArea / regionArea`, θ=0.9. In practice, largest gap determines axis	Simplified

Phase 3c was the only part that showed stable benchmark improvements. Projection-based gap detection finds the best cut in both directions, then splits along the larger gap.

We later hit an infinite recursion bug (#179) caused by a mismatch between gap detection (using object edges) and split assignment (using object centers). Fixed by adding a progress guarantee — fallback to Y-then-X sort when a split produces only one group — and a minimum gap threshold (5pt).

Stage 4: Cross-Modal Matching — Tried, then removed

The paper's most complex contribution:

Component	Description
Semantic Filtering	Reinsert by label priority: `cross-layout > title > vision > others`
Adaptive Distance Metric	4 constraints: φ1 (intersection), φ2 (boundary proximity), φ3 (vertical continuity), φ4 (horizontal ordering)
Dynamic Weight Adaptation	Scale-sensitive normalization based on page dimensions
Semantic-Specific Tuned Weights	Per-category weights from grid search over 2,800 documents

We implemented reinsertion logic using our semantic information and benchmarked it. Scores dropped. We tried parameter adjustments and several variations but couldn't produce stable improvements. From what I recall, the pattern was that inaccuracies in cross-layout detection combined with the reinsertion logic to amplify errors rather than correct them.

The experimental code was ultimately removed — the current codebase has no traces of it — and the experiment records have been lost. This is something we plan to document more carefully when we revisit it.

Current state

Our implementation started from the XY-Cut++ paper but converged through experimentation to Phase 3c (adaptive recursive segmentation) as the only reliably effective component. The code comments reflect this: "simplified geometric implementation without semantic type priorities".

This is honestly closer to an enhanced XY-Cut inspired by the paper's structure than a full XY-Cut++ implementation.

If you're implementing this yourself

Start with Stage 3c and establish a benchmark baseline
When adding Stage 2 + 3a, validate quantitatively — cross-layout detection can degrade existing results if reinsertion isn't robust
For Stage 4, start with φ2 + φ3 only — the intersection constraint (φ1) and weight tuning are where complexity explodes
Don't use the paper's weights directly — they're optimized for PP-DocLayout's detection characteristics. Validate on your own corpus

One specific thing I'd highlight: Stages 2 and 4 are tightly coupled. Cross-layout detection without reliable reinsertion makes things worse, not better. If you're going to implement one, plan for both.

Happy to discuss further.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences from XY-Cut++ paper #188

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Differences from XY-Cut++ paper #188

Uh oh!

ethnhll Feb 7, 2026

Replies: 2 comments

Uh oh!

hnc-jglee Mar 17, 2026 Maintainer

Uh oh!

bundolee Mar 17, 2026 Maintainer

Background

The paper's 4-stage pipeline

What happened at each stage

Stage 1: Layout Detection — Tried, then removed

Stage 2: Pre-Mask (Cross-Layout Detection) — Disabled

Stage 3: Multi-Granularity Segmentation — Core of our implementation

Stage 4: Cross-Modal Matching — Tried, then removed

Current state

If you're implementing this yourself

ethnhll
Feb 7, 2026

hnc-jglee
Mar 17, 2026
Maintainer

bundolee
Mar 17, 2026
Maintainer