Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement virtual arrays #3364

Merged
merged 141 commits into from
Mar 19, 2025
Merged

feat: implement virtual arrays #3364

merged 141 commits into from
Mar 19, 2025

Conversation

ikrommyd
Copy link
Collaborator

@ikrommyd ikrommyd commented Jan 9, 2025

Implement Virtual Buffers in Awkward Array

This PR introduces the VirtualArray class, which enables virtual buffers at the lowest level of an Awkward Array's structure. Awkward Arrays are fundamentally tree-like, with data stored in 1D array buffers (e.g., NumPy, CuPy, or JAX). The VirtualArray class acts as a virtual buffer, allowing deferred materialization of data when needed.

Key Features

  • Lazy Data Generation: A VirtualArray requires a known dtype and shape upon instantiation, along with a generator function responsible for producing the actual buffer data. This generator function is designed for use cases where materialization is expensive, such as disk reads.
  • NPLike Consistency: Each VirtualArray is tied to a specific nplike (NumPy or CuPy). Switching between different nplike backends is not allowed unless the array is first materialized.
  • Optimized Materialization: Some operations, such as trivial slicing, can be performed without triggering materialization, preserving the virtual nature of the array. Other operations that inherently require data access will trigger materialization on demand.

Adjustments to the Codebase

  • Array Operations: Updates to array_module ensure that certain operations remain virtual when possible while others materialize data when necessary.
  • Kernel Execution: Awkward Array kernels now enforce materialization before processing virtual arrays.
  • Conversion Functions: Functions of the form ak.to_* are updated to ensure that virtual arrays are materialized before conversion.
  • Layout Enhancements: Additional helper methods are introduced in each layout to efficiently check whether an array contains any virtual buffers.

This should be reviewed in combination with ikrommyd#1 which runs the whole awkward ci by creating a virtual array inside every NumpyArray and every Index and materializing them on the spot and also in combination with this branch: https://github.com/ikrommyd/awkward/tree/test-virtual-arrays-without-materialize where most of the tests pass without even needing to materialize inside the NumpyArray. You can compare this branch with the branch from ikrommyd#1 for the changes.

@ikrommyd ikrommyd marked this pull request as draft January 9, 2025 21:48
@ikrommyd ikrommyd force-pushed the virtual-arrays branch 3 times, most recently from e76182b to 3b9ad2f Compare January 21, 2025 13:41
pfackeldey and others added 18 commits January 23, 2025 13:26

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas
…level

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas
…uffers working in a notebook

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas
@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Mar 7, 2025

@agoose77 I think I've addressed your latest comments as well. Let me know if you're happy or if you have anything else to suggest.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Mar 7, 2025

@ianna @pfackeldey please let me press the button on this one when it's ready. I'd like to run my offline tests one final time before merging once everything is resolved 😄

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas
…to_cudf, to_arrow, to_backendarray
@ianna
Copy link
Collaborator

ianna commented Mar 7, 2025

@ianna @pfackeldey please let me press the button on this one when it's ready. I'd like to run my offline tests one final time before merging once everything is resolved 😄

Let's wait for @agoose77 to approve it first :-)

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Mar 7, 2025

I just realised that we cannot have a touching .raw() because that doesn't let us even cast layouts to typetracer without materializing. Also materializes when calling ak.to_buffers instead of returning virtual buffers. Therefore I'm attempting to make to_arrow and to_cudf and to_backend_array work without a materializing raw here: ae7f42b. I need to re-check again in all the layouts that materialize_if_virtual is in the right places.

This should deal with #3414 as well.

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas
@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Mar 9, 2025

@pfackeldey @agoose77 @ianna please review the changes of my last two commits. I basically realized that a materializing .raw is a bad idea because it breaks things like _to_backend and _to_buffers and will cause materialization all the time there. Therefore I reset the .raw() implementations to main and added a materialize_if_virtual in the places where the output of .raw() should be materialized. This is in things like _to_arrow, _to_list, _to_cudf etc. Basically wherever we wanna cast the awkward.Array to something else.

@ikrommyd
Copy link
Collaborator Author

@agoose77 A gentle ping on this :)

@ianna ianna requested a review from agoose77 March 13, 2025 14:26
@lgray
Copy link
Contributor

lgray commented Mar 17, 2025

@agoose77 We'd like to get started with porting a bunch of users from old awkward1 to awkward2 with virtual arrays (everyone using coffea 0.7, basically). It would be great if you could review this sooner rather than later. :-)

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@agoose77
Copy link
Collaborator

Hi all - I will be sure to look at this, likely tomorrow. I am off today dealing with a personal emergency, and appreciate everyone's patience.

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas
Copy link
Collaborator

@agoose77 agoose77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone here has clearly put a lot of time into reviewing this, and I appreciate that has taken more work with me being ad-hoc.

Personally, this feels like a big change, and normally we'd want to slowly move towards it. However, we realistically don't have the kind of capacity that we used to. My only major concern is whether this feature will make Awkward harder to maintain in future (such as the ability to reason about internal code). Especially if we can't remove this as a misstep later on down the road. I have also shared my concerns that virtual arrays definitely have their uses, but we should be careful that they don't act as a band-aid over "doing this properly" with typetracer and dask.

Having raised those concerns, I feel uncomfortable gate-keeping this feature; it clearly has well-motivated reasons, and I'm not a full-time Awkward maintainer any more.

The extent of my review has mainly been line-diffs, with some additional conversation around wider contexts. I have not done extensive testing to explore the interactions of this with other parts of Awkward; I don't have the kind of time for that at the moment.

Given all of the changes and the fact that @ikrommyd is very much testing this "in the wild" with coffea, I'm comfortable approving this given these caveats.

Big congratulations to @ikrommyd, @pfackeldey, and @ianna for getting this over the line 🥳

@ikrommyd
Copy link
Collaborator Author

Thank you very much for your feedback and your approval @agoose77!. It has been very valuable and I do understand your concerns.

I'd like to mention that we're not trying to band-aid over dask-awkward. We're trying to offer another solution as a stepping stone as collaborations don't see dask-awkward as mature enough yet to serve their full analyses needs and 99% of them are still using awkward1 (through coffea 0.7) for that reason. We want to offer people the opportunity to transition seamlessly to awkward2 (and coffea 2025) and not use outdated and unmaintained packages while at the same time trying to improve dask-awkward and dask-histogram. Also, virtual arrays can help us debug problems regarding tracing and also prevent us from hitting placeholder arrays at runtime when tracing goes wrong.

Regarding your concerns about how this interacts with other parts of awkward, I'd like to point out that there is a lot of testing in that direction. Apart from me running coffea analyses (like the AGC) with this, there is also this PR: ikrommyd#1 which runs the full awkward test suite by creating a virtual array inside every Index and every NumpyArray and materializing it on the spot. The whole CI passes. There is also this branch: https://github.com/ikrommyd/awkward/tree/test-virtual-arrays-without-materialize which does the same without materializing on the spot and 99% of the tests pass apart from some repr tests.

All in all I think the interaction with other parts of awkward is pretty well tested and of course we didn't break any existing tests as well.

@pfackeldey
Copy link
Collaborator

Thanks @agoose77 for your amazing review, it really shows your excellent understanding of the awkward codebase - you caught many things where e.g. I would not have thought about! That really increased the quality of this contribution, really big thanks for your time and help 🙏

I share your concern a bit about the maintainability. It feels like the nplikes / backends become a huge role now in awkward. VirtualArrays are also breaking a bit the standard array interface, e.g. applying a ufunc to an VirtualArray returns a different type than VirtualArray.

I think it would be useful to brainstorm a bit about the nplike/backend interface and start by re-defining all interfaces with e.g. Protocols and potentially open a way to externally provide new nplikes/backends to awkward. I could see benefit in this as there are more potentially interesting backends to be explored, e.g. python-blosc2 / zarr. These things could then live outside of awkward as contrib/experimental features, while keeping awkward itself more maintainable and more clean.

What do you think about this @agoose77 ?

Verified

This commit was signed with the committer’s verified signature.
ikrommyd Iason Krommydas
@ianna ianna merged commit ff30a26 into scikit-hep:main Mar 19, 2025
43 checks passed
@lgray
Copy link
Contributor

lgray commented Mar 19, 2025

If there could be an RC of awkward including this I'd like to test it with the coffea virtual arrays branch.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Mar 19, 2025

There there could be an RC of awkward including this I'd like to test it with the coffea virtual arrays branch.

I think @ianna wanted to do 2.8.0 immediately right? As long as we're sure we didn't break anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants