⚡️ Speed up method ElementHtml._get_children_html
by 234%
#4087
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 234% (2.34x) speedup for
ElementHtml._get_children_html
inunstructured/partition/html/convert.py
⏱️ Runtime :
12.3 milliseconds
→3.69 milliseconds
(best of101
runs)📝 Explanation and details
Here is a faster rewrite of your program, based on your line profiling results, the imported code constraints, and the code logic.
Key optimizations.
child.get_html_element(**kwargs)
, each of which is re-creating a newBeautifulSoup
object in every call.Solution: Pass down and reuse a single
BeautifulSoup
instance when building child HTML elements.soup
once at the topmost call and reuse for all children and subchildren.None
instead ofor []
, fast-path checks on empty children, etc.Below is the optimized version.
Explanation of improvements
get_html_element
method now optionally receives a_soup
kwarg. At the top of the tree, it isNone
, so a new one is created. Then, for all descendants, the samesoup
instance is passed via_soup
, avoiding repeated parsing and allocation.self.children
is checked once, and the attribute itself is kept as a list (not or-ed with empty list at every call).get_text_as_html()
doesn't need a soup argument, since it only returns a Tag (from the parent module).This avoids creating thousands of BeautifulSoup objects recursively, which was the primary bottleneck found in the profiler. The result is vastly improved performance, especially for large/complex trees.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-ElementHtml._get_children_html-mcsd67co
and push.