-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925
Conversation
…y in .array/EA.to_numpy()
@@ -269,6 +270,8 @@ class ExtensionArray: | |||
# strictly less than 2000 to be below Index.__pandas_priority__. | |||
__pandas_priority__ = 1000 | |||
|
|||
_readonly = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use arr.flags.writeable to be consistent with numpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this was easier for a quick POC ;)
It would indeed keep it more consistent in usage, so that might be a reason to add a flags
attribute, so code that needs to work with both ndarray or EA can use one code path. But I don't think we would ever add any of the other flags that numpy has, so not sure it would then be worth to add a nested attribute for this.
elif self._readonly and astype_is_view(self.dtype, result.dtype): | ||
# If the ExtensionArray is readonly, make the numpy array readonly too | ||
result = result.view() | ||
result.flags.writeable = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be done below the setting of na_value on L616?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, because in that case the result
array is already a copy, so no need to take a read-only view in that case
pd.date_range("2000", periods=4).array, | ||
pd.timedelta_range("2000", periods=4).array, | ||
pd.date_range("2000", periods=4).array.copy(), | ||
pd.timedelta_range("2000", periods=4).array.copy(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
._values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems that my test updates are a bit of a mix of both .array/values.copy()
or _values
. Will more consistently use _values
@@ -969,6 +975,8 @@ def __getitem__( | |||
# _NestedSequence[Union[bool, int]]], ...]]" | |||
data_slice = self.to_dense()[key] # type: ignore[index] | |||
elif isinstance(key, slice): | |||
if key == slice(None): | |||
return type(self)._simple_new(self.sp_values, self.sp_index, self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this special case needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid that arr[:]
makes a copy, and I got there because the default EA.view()
implementation uses that.
But can add a comment to clarify. There is a comment just below about "# Avoid densifying when handling contiguous slices", but that does not actually avoid making a copy in its current implementation because it translates the slice in integer indices. While for the special case of a full slice, that should not even be needed.
i get why .values and .array are made read-only, but why are we bothering with to_numpy? |
That's a good question, I didn't really think about it deeply .. But so for the non-extension dtypes, we also did it for I do think there is value in being consistent in those different ways to get a numpy array from the pandas object. So could also ask, why not for |
I don't feel strongly about this, but asked in the first place because it seems most of the code complexity in this PR is driven by to_numpy changes. Without that, most of this is just boilerplate edits to The main reason i can think of to treat to_numpy different from .array and .values is that it has an explicit |
Addresses one of the remaining TODO items from #48998
Similar as #51082 and some follow-up PRs, ensuring we also mark EAs as read-only like we do for numpy arrays, when the user gets the underlying EA from a pandas object.
For that purpose, added a
_readonly
attribute to the EA class that is False by default.Still need to add more tests and fix a bunch of tests