Skip to content

Avoid memory copy in obstore write #2972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

tomwhite
Copy link
Contributor

@tomwhite tomwhite commented Apr 8, 2025

This removes an unnecessary memory copy when writing using obstore, similar to #2944

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 8, 2025
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tom! 🙏

Had a question below

@@ -145,15 +147,15 @@ async def set(self, key: str, value: Buffer) -> None:

self._check_writable()

buf = value.to_bytes()
buf = value.as_numpy_array().view(np.uint8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile to make this a standard method? This would make it easier to use throughout the codebase

Some possible names:

  • to_bytelike
  • to_binary
  • to_uint8s
  • ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i'm also curious to know why we have a to_bytes() method that makes a copy, when .as_numpy_array().view(np.uint8) achieves the same effective result (an iterable of bytes) without a copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was suggested in #2925, so I've added an as_bytes_like method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes objects always own their memory. So creating a new bytes object means making a copy

If we prefer to move away from this approach, we could add a deprecation cycle to move from to_bytes to as_bytes_like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think FsspecStore works with BytesLike since it didn't accept a memoryview object when I tried it. So we probably need to keep to_bytes().

Copy link
Member

@jakirkham jakirkham Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this would be a good issue to raise with fsspec if we can reproduce with it directly

If not, a Zarr issue would be welcome

Ideally these copies should be avoidable in the FsspecStore case as well

Edit: Ofc this is non-blocking, just would like to improve the performance in all cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some possible names:

I'd also suggest as_buffer_like, because in my head "bytes" in Python is a type that always owns its own memory, while "buffer" doesn't necessarily own its memory, as in the buffer protocol or collections.abc.Buffer.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants