Skip to content

Conversation

aseembits93
Copy link
Contributor

📄 34% (0.34x) speedup for _assign_hash_ids in unstructured/partition/common/metadata.py

⏱️ Runtime : 88.4 microseconds 65.8 microseconds (best of 15 runs)

📝 Explanation and details

The optimization replaces itertools.groupby with a simple dictionary-based counting approach in the _assign_hash_ids function.

Key change: Instead of creating intermediate lists (page_numbers and page_seq_numbers) and using itertools.groupby, the optimized version uses a dictionary page_seq_counts to track sequence numbers for each page in a single pass.

Why it's faster:

  • Eliminates list comprehensions: The original code creates a full page_numbers list upfront, then processes it with groupby. The optimized version processes elements directly without intermediate collections.
  • Removes itertools.groupby overhead: groupby requires sorting/grouping operations that add computational complexity. The dictionary lookup page_seq_counts.get(page_number, 0) is O(1) vs the O(n) grouping operations.
  • Single-pass processing: Instead of two passes (first to collect page numbers, then to generate sequences), the optimization does everything in one loop through the elements.

Performance characteristics: The optimization is particularly effective for documents with many pages or elements, as shown in the test results where empty lists see 300%+ speedups. The 34% overall speedup demonstrates the efficiency gain from eliminating the itertools.groupby bottleneck, which consumed 19.5% + 6.3% of the original runtime according to the line profiler.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 26 Passed
🌀 Generated Regression Tests 2 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
partition/common/test_metadata.py::test_assign_hash_ids_produces_unique_and_deterministic_SHA1_ids_even_for_duplicate_elements 39.5μs 31.3μs 26.2%✅
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import abc
import hashlib
import itertools
from typing import Optional

# imports
import pytest  # used for our unit tests
from typing_extensions import TypeAlias
from unstructured.partition.common.metadata import _assign_hash_ids


class CoordinatesMetadata:
    def __init__(self, points=None, system=None):
        self.points = points
        self.system = system

class ElementMetadata:
    def __init__(
        self,
        filename: Optional[str] = None,
        page_number: Optional[int] = None,
        coordinates: Optional[CoordinatesMetadata] = None,
        detection_origin: Optional[str] = None,
    ):
        self.filename = filename
        self.page_number = page_number
        self.coordinates = coordinates
        self.detection_origin = detection_origin
from unstructured.partition.common.metadata import _assign_hash_ids

# ----------------- UNIT TESTS -----------------

# Helper to create an Element with required metadata
def make_element(
    text: str = "",
    filename: str = "file.pdf",
    page_number: int = 1,
    element_id: Optional[str] = None,
    detection_origin: Optional[str] = None,
):
    metadata = ElementMetadata(filename=filename, page_number=page_number)
    return Element(
        element_id=element_id,
        metadata=metadata,
        detection_origin=detection_origin,
        text=text,
    )

# ----------------- BASIC TEST CASES -----------------








def test_empty_input_list():
    # Should handle empty input gracefully
    codeflash_output = _assign_hash_ids([]); out = codeflash_output # 5.62μs -> 1.30μs (333% faster)














#------------------------------------------------
from __future__ import annotations

import abc
import hashlib
import itertools
from typing import Optional

# imports
import pytest  # used for our unit tests
from unstructured.partition.common.metadata import _assign_hash_ids


class ElementMetadata:
    def __init__(self, filename=None, page_number=None):
        self.filename = filename
        self.page_number = page_number
        self.coordinates = None
        self.detection_origin = None
from unstructured.partition.common.metadata import _assign_hash_ids

# --- Unit Tests ---

# Helper to create elements for testing
def make_element(text, filename="file.txt", page_number=1):
    metadata = ElementMetadata(filename=filename, page_number=page_number)
    e = Element(metadata=metadata)
    e.text = text
    return e

# 1. BASIC TEST CASES




def test_empty_elements_list():
    # Should handle empty list gracefully
    codeflash_output = _assign_hash_ids([]); result = codeflash_output # 5.04μs -> 1.14μs (342% faster)











#------------------------------------------------
from unstructured.documents.elements import Element
from unstructured.partition.common.metadata import _assign_hash_ids

def test__assign_hash_ids():
    _assign_hash_ids([Element(element_id=None, coordinates=None, coordinate_system=None, metadata=None, detection_origin=None)])
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_ktxbqhta/tmpb11w96m9/test_concolic_coverage.py::test__assign_hash_ids 38.2μs 32.1μs 19.1%✅

To edit these changes git checkout codeflash/optimize-_assign_hash_ids-memtfran and push.

Codeflash

codeflash-ai bot and others added 3 commits August 22, 2025 12:37
The optimization replaces `itertools.groupby` with a simple dictionary-based counting approach in the `_assign_hash_ids` function. 

**Key change:** Instead of creating intermediate lists (`page_numbers` and `page_seq_numbers`) and using `itertools.groupby`, the optimized version uses a dictionary `page_seq_counts` to track sequence numbers for each page in a single pass.

**Why it's faster:**
- **Eliminates list comprehensions:** The original code creates a full `page_numbers` list upfront, then processes it with `groupby`. The optimized version processes elements directly without intermediate collections.
- **Removes `itertools.groupby` overhead:** `groupby` requires sorting/grouping operations that add computational complexity. The dictionary lookup `page_seq_counts.get(page_number, 0)` is O(1) vs the O(n) grouping operations.
- **Single-pass processing:** Instead of two passes (first to collect page numbers, then to generate sequences), the optimization does everything in one loop through the elements.

**Performance characteristics:** The optimization is particularly effective for documents with many pages or elements, as shown in the test results where empty lists see 300%+ speedups. The 34% overall speedup demonstrates the efficiency gain from eliminating the `itertools.groupby` bottleneck, which consumed 19.5% + 6.3% of the original runtime according to the line profiler.
@aseembits93
Copy link
Contributor Author

@qued hope you could review it :) Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant