Skip to content

fix(pptx): process slide shapes in visual reading order#3393

Open
Adityav20 wants to merge 4 commits into
docling-project:mainfrom
Adityav20:pptx/fix-pptx-parsing
Open

fix(pptx): process slide shapes in visual reading order#3393
Adityav20 wants to merge 4 commits into
docling-project:mainfrom
Adityav20:pptx/fix-pptx-parsing

Conversation

@Adityav20
Copy link
Copy Markdown

Description:

This PR improves PPTX parsing by processing slide shapes in visual reading order instead of PowerPoint’s internal creation/z-order.

Previously, when a slide had multiple subheadings with separate bullet text boxes, bullet points could be grouped under the wrong subheading if the PPTX shape order did not match the visible slide layout.

The PPTX backend now sorts shapes top-to-bottom and left-to-right, with a small row tolerance for near-aligned objects. This keeps visually related subheadings and bullet lists together during extraction.

A regression test was added for a slide where the second bullet textbox is stored before its subheading internally, but appears below it visually.

Issue resolved by this Pull Request:
Resolves #1324

Checklist:

  • [ x ] Documentation has been updated, if necessary.
  • [ x ] Examples have been added, if necessary.
  • [ x ] Tests have been added, if necessary.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

DCO Check Passed

Thanks @Adityav20, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 1, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, rifi9351 <aditya.vikram@uni-weimar.de>, hereby add my Signed-off-by to this commit: bc09fca

Signed-off-by: rifi9351 <aditya.vikram@uni-weimar.de>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

❌ Patch coverage is 5.71429% with 33 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/mspowerpoint_backend.py 5.71% 33 Missing ⚠️

📢 Thoughts on this report? Let us know!

Adityav20 added 2 commits May 3, 2026 23:55
Signed-off-by: rifi9351 <aditya.vikram@uni-weimar.de>
Signed-off-by: rifi9351 <aditya.vikram@uni-weimar.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PPTX parsing: bullet points not grouped correctly under subheadings

1 participant