Replies: 2 comments
-
|
Hi @ethnhll, great question! Thanks for diving into the paper and comparing it with our implementation. You're right that our XY-Cut++ implementation ( Why no cross-modal matching? The paper's cross-modal matching is designed to align visual (image-based) and textual modalities — essentially reconciling what an OCR/vision model "sees" with what text extraction produces. In opendataloader-pdf, we work directly with structured PDF content (tagged PDF structure trees, text extracted from the PDF stream, etc.), so we already have reliable geometric and structural information without needing to bridge two separate modalities. The cross-modal alignment step becomes redundant when your source of truth is the PDF structure itself rather than a combination of vision + OCR pipelines. What we do instead (Phase 4 — geometric remerging): Our implementation handles cross-layout elements (headers, footers, spanning titles) through a purely geometric approach:
This works well for our use case because the spatial coordinates from PDF extraction are precise — we don't have the noise/uncertainty that would come from vision-based detection. Regarding semantic-specific tuned weights: This is indeed one of the more opaque parts of the paper. Since we skipped cross-modal matching entirely, we didn't need to implement those weights. Our density threshold ( For a more accurate and detailed answer, I'd like to invite @bundolee, who led this implementation, to share their insights. If you have more specific questions about particular parts of the algorithm, feel free to ask! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @ethnhll, great question. As @hnc-jglee mentioned, I led this implementation. Let me share the context behind our design decisions. BackgroundWe started with the goal of faithfully implementing the full 4-stage pipeline from the XY-Cut++ paper (arXiv:2504.10258). opendataloader-pdf can infer semantic information from PDFs (titles, body text, tables, figures, etc.), so the paper's semantic label-based approach was technically feasible for us. However, when we applied the full pipeline and ran benchmarks, enabling Stage 1 (Layout Detection) and Stage 4 (Cross-Modal Matching) actually decreased our scores. We went through multiple rounds of parameter tuning and architectural adjustments but couldn't achieve stable improvements, and eventually converged on the current form — essentially Phase 3c (recursive segmentation) only.
The paper's 4-stage pipeline
What happened at each stageStage 1: Layout Detection — Tried, then removedopendataloader-pdf infers semantic types (title, body, table, figure, etc.) from PDF structure trees. The approach differs from the paper's PP-DocLayout (structure-based vs. vision-based), but both can provide semantic labels. When we used these labels to drive Stages 2–4, benchmark scores did not improve and in some cases declined. Stage 2: Pre-Mask (Cross-Layout Detection) — Disabled
With lower beta values, elements were misclassified as cross-layout, pulled out of recursive segmentation, and reinserted at incorrect positions. Stage 4 is supposed to compensate for this, but since Stage 4 itself wasn't working reliably in our environment, we disabled both together. Stage 3: Multi-Granularity Segmentation — Core of our implementation
Phase 3c was the only part that showed stable benchmark improvements. Projection-based gap detection finds the best cut in both directions, then splits along the larger gap. We later hit an infinite recursion bug (#179) caused by a mismatch between gap detection (using object edges) and split assignment (using object centers). Fixed by adding a progress guarantee — fallback to Y-then-X sort when a split produces only one group — and a minimum gap threshold (5pt). Stage 4: Cross-Modal Matching — Tried, then removedThe paper's most complex contribution:
We implemented reinsertion logic using our semantic information and benchmarked it. Scores dropped. We tried parameter adjustments and several variations but couldn't produce stable improvements. From what I recall, the pattern was that inaccuracies in cross-layout detection combined with the reinsertion logic to amplify errors rather than correct them. The experimental code was ultimately removed — the current codebase has no traces of it — and the experiment records have been lost. This is something we plan to document more carefully when we revisit it. Current stateOur implementation started from the XY-Cut++ paper but converged through experimentation to Phase 3c (adaptive recursive segmentation) as the only reliably effective component. The code comments reflect this: This is honestly closer to an enhanced XY-Cut inspired by the paper's structure than a full XY-Cut++ implementation. If you're implementing this yourself
One specific thing I'd highlight: Stages 2 and 4 are tightly coupled. Cross-layout detection without reliable reinsertion makes things worse, not better. If you're going to implement one, plan for both. Happy to discuss further. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I was wondering if you could provide some background on the chosen differences between the paper for xy-cut++ and your implementation of it. Chiefly, I am looking at the remerging of elements at the end into a sorted order. The paper goes through a cross-modal matching approach that I see is not part of opendataloader's implementation.
The reason I'm asking is, id like to understand your though processes, or how you came to choose the implementation that you did. I have been struggling quite a bit with trying to interpret a specific part of the cross modal matching section in the paper (semantic-specific tuned weights). Any insights you might have to share about that portion of the paper would also be a great help :)
Beta Was this translation helpful? Give feedback.
All reactions