Skip to content

fix MultiIndex.difference not working with PyArrow timestamps (#61382) ,and some ruff formating fix #61391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

NEREUScode
Copy link

Problem

The MultiIndex.difference method fails to remove entries when the index contains PyArrow-backed timestamps (timestamp[ns][pyarrow]). This occurs because direct tuple comparisons with PyArrow scalar types are unreliable during membership checks, causing entries to remain unexpectedly.

Example:

# PyArrow timestamp index
df = DataFrame(...).astype({"date": "timestamp[ns][pyarrow]"}).set_index(["id", "date"])
idx_val = df.index[0]
new_index = df.index.difference([idx_val])  # Fails to remove idx_val

Solution
Code Conversion: Map other values to integer codes compatible with the original index's levels.

Engine Validation: Use the MultiIndex's internal engine for membership checks, ensuring accurate handling of PyArrow types.

Mask-Based Exclusion: Create a boolean mask to filter out matched entries, then reconstruct the index.

Testing
Added a test in pandas/tests/indexes/multi/test_setops.py that:

Creates a MultiIndex with PyArrow timestamps.

Validates difference correctly removes entries.

Skips the test if PyArrow is not installed.

Use Case Impact
Fixes scenarios where users filter hierarchical datasets with PyArrow timestamps, such as:

# Remove specific timestamps from a time-series index
clean_index = raw_index.difference(unwanted_timestamps)

Closes #61382.

Copy link
Author

@NEREUScode NEREUScode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't understand the error with my test code can someone review it and tell me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Multindex difference not working on columns with type Timestamp[ns][pyarrow]
1 participant