-
Notifications
You must be signed in to change notification settings - Fork 191
Improve #449: Improve StridedMemoryView creation time #838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the performance of StridedMemoryView creation by optimizing the tuple creation process and refactoring conditional logic. The changes achieve a ~17% performance improvement by using Python/C API calls instead of Python comprehensions for creating shape and strides tuples.
- Refactored versioned/non-versioned DLPack path handling to reduce branching complexity
- Replaced Python tuple comprehensions with direct Python/C API calls for creating shape and strides tuples
- Optimized stride conversion logic in the CAI (CUDA Array Interface) path
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Two changes: 1. Refactor the versioned/non-versioned paths to reduce the number of branches. 2. Create shape and strides tuples using Python/C API
The changes here LGTM! To trigger CI you need to comment "/ok to test", see here for more info: https://docs.gha-runners.nvidia.com/platform/apps/copy-pr-bot/faqs/ Most people are using the |
/ok to test |
Yes, I have been thinking these attributes should be populated lazily, which is possible because we still hold a reference to the dlpack capsule. @mdboom would you like to address this in a follow-up PR? |
This comment has been minimized.
This comment has been minimized.
|
||
cdef StridedMemoryView buf = StridedMemoryView() if view is None else view | ||
buf.ptr = <intptr_t>(dl_tensor.data) | ||
buf.shape = tuple(int(dl_tensor.shape[i]) for i in range(dl_tensor.ndim)) | ||
|
||
# Construct shape and strides tuples using the Python/C API for speed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Does this mean Cython is not generating efficient code for us? Could it be possible that I was using the generator form to construct a tuple and Cython is bad at this syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is pretty inefficient. The generator causes Cython to make a separate C function which is then called through the Python iterator machinery, which is a lot less efficient than a C for loop (mainly because it makes a bunch of C function calls through C function pointers). Secondly, since the length of the tuple isn't known in advance, it causes it to be realloc'ed at least 3 times:
- initial size of 0
- overallocate to 10
- truncate to the exact value of 2
It might be possible to write a cdef function to convert a C array pointer + known length into a tuple in this faster way, which would hide this use of the CPython API here. I'll look at that and measure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abstracting this out to a cdef inline
function seems to have no measurable overhead and definitely makes this more readable, so I've updated this PR to do that.
# Construct shape and strides tuples using the Python/C API for speed | ||
buf.shape = cpython.PyTuple_New(dl_tensor.ndim) | ||
for i in range(dl_tensor.ndim): | ||
cpython.PyTuple_SET_ITEM(buf.shape, i, cpython.PyLong_FromLong(dl_tensor.shape[i])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can use PyLong_FromLong
because both shape and strides are int64_t
https://github.com/dmlc/dlpack/blob/7f393bbb86a0ddd71fde3e700fc2affa5cdce72d/include/dlpack/dlpack.h#L257-L263
we probably need to replace it with PyLong_FromLongLong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question for my understanding: Both PyTuple_SET_ITEM
and PyTuple_Set_Item
steal the reference of the object o
(the Python int that we convert from DLPack shape/stride members). Doesn't it mean that we should increment its refcount before calling setitem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about using LongLong
. I will update.
As for reference counting, we don't need to increment here because we are "giving" the reference to the int to the tuple, and we don't need to hold on to the int ourselves to do anything else with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so either constructor would return the obj with refcount 1 (for some reason I was thinking it's 0), and the ownership of this ref is transferred to the tuple.
Yeah, that makes sense. I'll do that in a separate PR and keep this just to the tuple-construction improvements. |
/ok to test |
/ok to test |
|
Two changes:
The second change has a significant impact:
Measured using this pyperf benchmark, based on the one in #449
Description
Improves #449
Checklist