Skip to content

feat(extractor): impl HwpxExtractor for application/haansofthwpx#875

Open
kh3rld wants to merge 2 commits intomainfrom
feat/849-hwpx-extractor
Open

feat(extractor): impl HwpxExtractor for application/haansofthwpx#875
kh3rld wants to merge 2 commits intomainfrom
feat/849-hwpx-extractor

Conversation

@kh3rld
Copy link
Copy Markdown
Contributor

@kh3rld kh3rld commented May 3, 2026

Summary

  • Adds HwpxExtractor for HWPX (Open HWPML / Hangul Word Processor XML) documents
  • Uses unhwp crate (MIT, default-features = false, features = ["hwpx"]) — avoids writing custom OWPML XML parsing
  • Extracts headings, paragraphs, tables, and embedded images
  • Adds ZIP magic-byte detection for application/haansofthwpx in the MIME pipeline
  • Gates behind a new hwpx Cargo feature, included in formats and full aggregates
  • HwpxExtractor added to alef.toml exclude_types — it is an internal implementation type, not a public API surface

Why unhwp instead of custom XML parsing?

HWPX is a ZIP-of-OWPML format with a non-trivial schema (sections, paragraphs, inline runs, image resources). unhwp models this correctly in ~zero lines of custom parsing and is MIT-licensed with no runtime deps when used with default-features = false.

Test plan

  • cargo test -p kreuzberg --features hwpx — unit tests in hwpx.rs
  • fixtures/format_specific/format_hwpx_standalone.json fixture added; run alef e2e generate && task test:e2e to verify across all language bindings
  • MIME detection: ZIP with Contents/content.hpf marker routes to application/haansofthwpx

Closes #849

kh3rld added 2 commits May 3, 2026 06:13
HwpExtractor uses a CFB parser and cannot handle HWPX (ZIP-based XML).
Routing application/haansofthwpx to this extractor causes parsing
failures. Remove the incorrect MIME type claim; HWPX support will be
added via a dedicated extractor in #849.

Fixes #851
Introduces a dedicated extractor for HWPX (Open HWPML) documents using
the `unhwp` crate (MIT, default-features = false, features = ["hwpx"]).
Handles headings, paragraphs, tables, and embedded images. Adds ZIP
marker detection in the MIME pipeline and gates everything behind a new
`hwpx` Cargo feature (included in `formats` and `full`).

Closes #849
@kh3rld kh3rld requested a review from Goldziher as a code owner May 3, 2026 13:33
@kh3rld kh3rld changed the title feat(extractor): add HwpxExtractor for application/haansofthwpx feat(extractor): impl HwpxExtractor for application/haansofthwpx May 3, 2026
@kh3rld kh3rld added the enhancement New feature or request label May 3, 2026
@kh3rld kh3rld moved this from Todo to In Review in Open-Source Kanban May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

feat: implement dedicated HWPX extractor

1 participant