Skip to content

CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

Addresses one of the remaining TODO items from #48998

Similar as #51082 and some follow-up PRs, ensuring we also mark EAs as read-only like we do for numpy arrays, when the user gets the underlying EA from a pandas object.
For that purpose, added a _readonly attribute to the EA class that is False by default.

Still need to add more tests and fix a bunch of tests

@@ -269,6 +270,8 @@ class ExtensionArray:
# strictly less than 2000 to be below Index.__pandas_priority__.
__pandas_priority__ = 1000

_readonly = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use arr.flags.writeable to be consistent with numpy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this was easier for a quick POC ;)
It would indeed keep it more consistent in usage, so that might be a reason to add a flags attribute, so code that needs to work with both ndarray or EA can use one code path. But I don't think we would ever add any of the other flags that numpy has, so not sure it would then be worth to add a nested attribute for this.

elif self._readonly and astype_is_view(self.dtype, result.dtype):
# If the ExtensionArray is readonly, make the numpy array readonly too
result = result.view()
result.flags.writeable = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be done below the setting of na_value on L616?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because in that case the result array is already a copy, so no need to take a read-only view in that case

pd.date_range("2000", periods=4).array,
pd.timedelta_range("2000", periods=4).array,
pd.date_range("2000", periods=4).array.copy(),
pd.timedelta_range("2000", periods=4).array.copy(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

._values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems that my test updates are a bit of a mix of both .array/values.copy() or _values. Will more consistently use _values

@@ -969,6 +975,8 @@ def __getitem__(
# _NestedSequence[Union[bool, int]]], ...]]"
data_slice = self.to_dense()[key] # type: ignore[index]
elif isinstance(key, slice):
if key == slice(None):
return type(self)._simple_new(self.sp_values, self.sp_index, self.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this special case needed?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid that arr[:] makes a copy, and I got there because the default EA.view() implementation uses that.

But can add a comment to clarify. There is a comment just below about "# Avoid densifying when handling contiguous slices", but that does not actually avoid making a copy in its current implementation because it translates the slice in integer indices. While for the special case of a full slice, that should not even be needed.

@jbrockmendel
Copy link
Member

i get why .values and .array are made read-only, but why are we bothering with to_numpy?

@jorisvandenbossche
Copy link
Member Author

That's a good question, I didn't really think about it deeply .. But so for the non-extension dtypes, we also did it for .values / __array__ and to_numpy() (#51082), and so followed along here.

I do think there is value in being consistent in those different ways to get a numpy array from the pandas object. So could also ask, why not for to_numpy()? And then compared to .values, to_numpy() actually gives you more control with the ability to ask for a copy.
(in practice the implementation of __array__ and to_numpy() are also quite overlapping for the EAs.

@jbrockmendel
Copy link
Member

So could also ask, why not for to_numpy()?

I don't feel strongly about this, but asked in the first place because it seems most of the code complexity in this PR is driven by to_numpy changes. Without that, most of this is just boilerplate edits to __getitem__ methods.

The main reason i can think of to treat to_numpy different from .array and .values is that it has an explicit copy keyword. With copy=False, the user ideally understands that they are getting a view on existing data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants