-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Raise MergeError when suffixes result in duplicate column names … #61422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fe71e1a
to
c9bfc5a
Compare
pre-commit.ci autofix |
dc1bb47
to
ad744df
Compare
Thanks for your contribution! Just a quick note: you don't need to write
For reference, you can check the contributing guidelines here: https://pandas.pydata.org/docs/development/contributing_codebase.html#documenting-your-code |
Thanks for the pointers. I'll get those added in here soon. Trying to track down why the Unit Tests / Linux-32-bit(pull_request) is failing. I didn't change anything that should have effected Series, so it's kinda weird. I also can't get the pytest to run normally on my dev yet either, so I haven't been able to fully replicate the failure locally yet. So, still a little more work to do here. |
@Farsidetfs I believe the CI failure is not related to your changes. It appears to be caused by the cython version — pandas unit tests fail with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments from a bird's eye view. Thanks for the change.
pandas/core/reshape/merge.py
Outdated
if len(left_collisions) > 0: | ||
raise MergeError( | ||
"Passing 'suffixes' which cause duplicate columns " | ||
f"{set(left_collisions)} is not allowed" | ||
) | ||
if len(right_collisions) > 0: | ||
raise MergeError( | ||
"Passing 'suffixes' which cause duplicate columns " | ||
f"{set(right_collisions)} is not allowed" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend you combine this into a common error to reduce repetition (bonus points for combining it with the pre-existing error just a few lines below)
pandas/core/reshape/merge.py
Outdated
# Check for duplicates created by suffixes | ||
left_collisions = llabels.intersection(right.difference(to_rename)) | ||
right_collisions = rlabels.intersection(left.difference(to_rename)) | ||
if len(left_collisions) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't just if not left_collisions.empty:
work? Same for a similar check below.
pandas/core/reshape/merge.py
Outdated
@@ -3058,6 +3058,20 @@ def renamer(x, suffix: str | None): | |||
llabels = left._transform_index(lrenamer) | |||
rlabels = right._transform_index(rrenamer) | |||
|
|||
# Check for duplicates created by suffixes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new readers of this code, the comment might not be descriptive enough. While your code is supposed to find suffixes that are caused by duplicated would-be created columns across dataframes, there is an extra section that does a duplicate checking just below your new code (but would-be created columns due to duplicity within the same dataframe).
ad744df
to
9a40cd0
Compare
pre-commit.ci autofix |
e9efd0d
to
1476957
Compare
pre-commit.ci autofix |
cb23817
to
6b82c85
Compare
pre-commit.ci autofix |
5aff565
to
e62c70f
Compare
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
@nikaltipar I think this should be ready now. Please let me know if I've missed anything. I took your advice and combined the two with slight modifications to improve efficiency using sets throughout rather than just convert at the end. |
Thanks for taking care of that, @Farsidetfs ! It looks good to me, no other comments from my side. Thanks for adding the unit-tests, too! |
@nikaltipar Could you rebase main branch to trigger CI again? |
I am not able to, I'll have to wait for @Farsidetfs |
This PR addresses GH#61402 by ensuring that merge() raises a MergeError if the specified suffixes fail to eliminate column name collisions.
The suffix logic now explicitly checks for overlaps after applying suffixes and raises a clear error if duplicates remain.
Includes a test in test_merge.py to confirm that suffixes like ('_dup', '_dup') raise the expected error when merging conflicting column names.
Closes GH#61402.