-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement virtual arrays #3364
Conversation
e76182b
to
3b9ad2f
Compare
…uffers working in a notebook
1eb5c3f
to
166fcb0
Compare
@agoose77 I think I've addressed your latest comments as well. Let me know if you're happy or if you have anything else to suggest. |
@ianna @pfackeldey please let me press the button on this one when it's ready. I'd like to run my offline tests one final time before merging once everything is resolved 😄 |
…to_cudf, to_arrow, to_backendarray
Let's wait for @agoose77 to approve it first :-) |
I just realised that we cannot have a touching This should deal with #3414 as well. |
@pfackeldey @agoose77 @ianna please review the changes of my last two commits. I basically realized that a materializing |
@agoose77 A gentle ping on this :) |
@agoose77 We'd like to get started with porting a bunch of users from old awkward1 to awkward2 with virtual arrays (everyone using coffea 0.7, basically). It would be great if you could review this sooner rather than later. :-) |
Hi all - I will be sure to look at this, likely tomorrow. I am off today dealing with a personal emergency, and appreciate everyone's patience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everyone here has clearly put a lot of time into reviewing this, and I appreciate that has taken more work with me being ad-hoc.
Personally, this feels like a big change, and normally we'd want to slowly move towards it. However, we realistically don't have the kind of capacity that we used to. My only major concern is whether this feature will make Awkward harder to maintain in future (such as the ability to reason about internal code). Especially if we can't remove this as a misstep later on down the road. I have also shared my concerns that virtual arrays definitely have their uses, but we should be careful that they don't act as a band-aid over "doing this properly" with typetracer and dask.
Having raised those concerns, I feel uncomfortable gate-keeping this feature; it clearly has well-motivated reasons, and I'm not a full-time Awkward maintainer any more.
The extent of my review has mainly been line-diffs, with some additional conversation around wider contexts. I have not done extensive testing to explore the interactions of this with other parts of Awkward; I don't have the kind of time for that at the moment.
Given all of the changes and the fact that @ikrommyd is very much testing this "in the wild" with coffea, I'm comfortable approving this given these caveats.
Big congratulations to @ikrommyd, @pfackeldey, and @ianna for getting this over the line 🥳
Thank you very much for your feedback and your approval @agoose77!. It has been very valuable and I do understand your concerns. I'd like to mention that we're not trying to band-aid over dask-awkward. We're trying to offer another solution as a stepping stone as collaborations don't see dask-awkward as mature enough yet to serve their full analyses needs and 99% of them are still using awkward1 (through coffea 0.7) for that reason. We want to offer people the opportunity to transition seamlessly to awkward2 (and coffea 2025) and not use outdated and unmaintained packages while at the same time trying to improve dask-awkward and dask-histogram. Also, virtual arrays can help us debug problems regarding tracing and also prevent us from hitting placeholder arrays at runtime when tracing goes wrong. Regarding your concerns about how this interacts with other parts of awkward, I'd like to point out that there is a lot of testing in that direction. Apart from me running coffea analyses (like the AGC) with this, there is also this PR: ikrommyd#1 which runs the full awkward test suite by creating a virtual array inside every All in all I think the interaction with other parts of awkward is pretty well tested and of course we didn't break any existing tests as well. |
Thanks @agoose77 for your amazing review, it really shows your excellent understanding of the awkward codebase - you caught many things where e.g. I would not have thought about! That really increased the quality of this contribution, really big thanks for your time and help 🙏 I share your concern a bit about the maintainability. It feels like the nplikes / backends become a huge role now in awkward. I think it would be useful to brainstorm a bit about the nplike/backend interface and start by re-defining all interfaces with e.g. Protocols and potentially open a way to externally provide new nplikes/backends to awkward. I could see benefit in this as there are more potentially interesting backends to be explored, e.g. python-blosc2 / zarr. These things could then live outside of awkward as contrib/experimental features, while keeping awkward itself more maintainable and more clean. What do you think about this @agoose77 ? |
If there could be an RC of awkward including this I'd like to test it with the coffea virtual arrays branch. |
I think @ianna wanted to do 2.8.0 immediately right? As long as we're sure we didn't break anything. |
Implement Virtual Buffers in Awkward Array
This PR introduces the
VirtualArray
class, which enables virtual buffers at the lowest level of an Awkward Array's structure. Awkward Arrays are fundamentally tree-like, with data stored in 1D array buffers (e.g., NumPy, CuPy, or JAX). TheVirtualArray
class acts as a virtual buffer, allowing deferred materialization of data when needed.Key Features
VirtualArray
requires a knowndtype
andshape
upon instantiation, along with a generator function responsible for producing the actual buffer data. This generator function is designed for use cases where materialization is expensive, such as disk reads.VirtualArray
is tied to a specificnplike
(NumPy or CuPy). Switching between differentnplike
backends is not allowed unless the array is first materialized.Adjustments to the Codebase
array_module
ensure that certain operations remain virtual when possible while others materialize data when necessary.ak.to_*
are updated to ensure that virtual arrays are materialized before conversion.This should be reviewed in combination with ikrommyd#1 which runs the whole awkward ci by creating a virtual array inside every
NumpyArray
and everyIndex
and materializing them on the spot and also in combination with this branch: https://github.com/ikrommyd/awkward/tree/test-virtual-arrays-without-materialize where most of the tests pass without even needing to materialize inside theNumpyArray
. You can compare this branch with the branch from ikrommyd#1 for the changes.