⚡️ Speed up function _assign_hash_ids
by 34%
#4089
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 34% (0.34x) speedup for
_assign_hash_ids
inunstructured/partition/common/metadata.py
⏱️ Runtime :
88.4 microseconds
→65.8 microseconds
(best of15
runs)📝 Explanation and details
The optimization replaces
itertools.groupby
with a simple dictionary-based counting approach in the_assign_hash_ids
function.Key change: Instead of creating intermediate lists (
page_numbers
andpage_seq_numbers
) and usingitertools.groupby
, the optimized version uses a dictionarypage_seq_counts
to track sequence numbers for each page in a single pass.Why it's faster:
page_numbers
list upfront, then processes it withgroupby
. The optimized version processes elements directly without intermediate collections.itertools.groupby
overhead:groupby
requires sorting/grouping operations that add computational complexity. The dictionary lookuppage_seq_counts.get(page_number, 0)
is O(1) vs the O(n) grouping operations.Performance characteristics: The optimization is particularly effective for documents with many pages or elements, as shown in the test results where empty lists see 300%+ speedups. The 34% overall speedup demonstrates the efficiency gain from eliminating the
itertools.groupby
bottleneck, which consumed 19.5% + 6.3% of the original runtime according to the line profiler.✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
partition/common/test_metadata.py::test_assign_hash_ids_produces_unique_and_deterministic_SHA1_ids_even_for_duplicate_elements
🌀 Generated Regression Tests and Runtime
🔎 Concolic Coverage Tests and Runtime
codeflash_concolic_ktxbqhta/tmpb11w96m9/test_concolic_coverage.py::test__assign_hash_ids
To edit these changes
git checkout codeflash/optimize-_assign_hash_ids-memtfran
and push.