feat(extractor): impl HwpxExtractor for application/haansofthwpx#875
Open
feat(extractor): impl HwpxExtractor for application/haansofthwpx#875
Conversation
Introduces a dedicated extractor for HWPX (Open HWPML) documents using the `unhwp` crate (MIT, default-features = false, features = ["hwpx"]). Handles headings, paragraphs, tables, and embedded images. Adds ZIP marker detection in the MIME pipeline and gates everything behind a new `hwpx` Cargo feature (included in `formats` and `full`). Closes #849
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
HwpxExtractorfor HWPX (Open HWPML / Hangul Word Processor XML) documentsunhwpcrate (MIT,default-features = false, features = ["hwpx"]) — avoids writing custom OWPML XML parsingapplication/haansofthwpxin the MIME pipelinehwpxCargo feature, included informatsandfullaggregatesHwpxExtractoradded toalef.tomlexclude_types— it is an internal implementation type, not a public API surfaceWhy
unhwpinstead of custom XML parsing?HWPX is a ZIP-of-OWPML format with a non-trivial schema (sections, paragraphs, inline runs, image resources).
unhwpmodels this correctly in ~zero lines of custom parsing and is MIT-licensed with no runtime deps when used withdefault-features = false.Test plan
cargo test -p kreuzberg --features hwpx— unit tests inhwpx.rsfixtures/format_specific/format_hwpx_standalone.jsonfixture added; runalef e2e generate && task test:e2eto verify across all language bindingsContents/content.hpfmarker routes toapplication/haansofthwpxCloses #849