Skip to content

Improve #449: Improve StridedMemoryView creation time #838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 15, 2025

Conversation

mdboom
Copy link
Contributor

@mdboom mdboom commented Aug 14, 2025

Two changes:

  1. Refactor the versioned/non-versioned paths to reduce the number of branches.
  2. Create shape and strides tuples using Python/C API

The second change has a significant impact:

  • Before: 1.23 us +- 0.06 us
  • After: 1.02 us +- 0.06 us
Measured using this pyperf benchmark, based on the one in #449
import pyperf
import time


import cupy as cp
from cuda.core.experimental.utils import StridedMemoryView


inner_loops = 100_000


def bench_strided_memory_view(loops, x):
    range_it = range(loops * inner_loops)
    t0 = time.perf_counter()

    for _ in range_it:
        s = StridedMemoryView(x, -1)

    return time.perf_counter() - t0


runner = pyperf.Runner()
x = cp.empty((23, 4))
runner.bench_time_func('StridedMemoryView', bench_strided_memory_view, x, inner_loops=inner_loops)

Description

Improves #449

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Contributor

copy-pr-bot bot commented Aug 14, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mdboom mdboom requested review from leofang and Copilot August 14, 2025 19:58
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the performance of StridedMemoryView creation by optimizing the tuple creation process and refactoring conditional logic. The changes achieve a ~17% performance improvement by using Python/C API calls instead of Python comprehensions for creating shape and strides tuples.

  • Refactored versioned/non-versioned DLPack path handling to reduce branching complexity
  • Replaced Python tuple comprehensions with direct Python/C API calls for creating shape and strides tuples
  • Optimized stride conversion logic in the CAI (CUDA Array Interface) path

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Two changes:

1. Refactor the versioned/non-versioned paths to reduce the number of branches.
2. Create shape and strides tuples using Python/C API
kkraus14
kkraus14 previously approved these changes Aug 14, 2025
@github-project-automation github-project-automation bot moved this from Todo to In Review in CCCL Aug 14, 2025
@kkraus14
Copy link
Collaborator

The changes here LGTM! To trigger CI you need to comment "/ok to test", see here for more info: https://docs.gha-runners.nvidia.com/platform/apps/copy-pr-bot/faqs/

Most people are using the StridedMemoryView class as a way to generically handle input arrays that support dlpack and/or __cuda_array_interface__ right before kernel launches which is typically interfacing to some underlying C++ code. Currently when we initialize the class we eagerly populate python attributes for things like the ptr, shape, strides, dtype, etc. There's a non-trivial cost to this where I wonder if we should instead have C types and then lazily initialize the Python types as requested.

@mdboom
Copy link
Contributor Author

mdboom commented Aug 15, 2025

/ok to test

@leofang
Copy link
Member

leofang commented Aug 15, 2025

Currently when we initialize the class we eagerly populate python attributes for things like the ptr, shape, strides, dtype, etc. There's a non-trivial cost to this where I wonder if we should instead have C types and then lazily initialize the Python types as requested.

Yes, I have been thinking these attributes should be populated lazily, which is possible because we still hold a reference to the dlpack capsule. @mdboom would you like to address this in a follow-up PR?

@leofang leofang added enhancement Any code-related improvements P1 Medium priority - Should do cuda.core Everything related to the cuda.core module labels Aug 15, 2025
@leofang leofang added this to the cuda.core beta 7 milestone Aug 15, 2025

This comment has been minimized.


cdef StridedMemoryView buf = StridedMemoryView() if view is None else view
buf.ptr = <intptr_t>(dl_tensor.data)
buf.shape = tuple(int(dl_tensor.shape[i]) for i in range(dl_tensor.ndim))

# Construct shape and strides tuples using the Python/C API for speed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Does this mean Cython is not generating efficient code for us? Could it be possible that I was using the generator form to construct a tuple and Cython is bad at this syntax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is pretty inefficient. The generator causes Cython to make a separate C function which is then called through the Python iterator machinery, which is a lot less efficient than a C for loop (mainly because it makes a bunch of C function calls through C function pointers). Secondly, since the length of the tuple isn't known in advance, it causes it to be realloc'ed at least 3 times:

  • initial size of 0
  • overallocate to 10
  • truncate to the exact value of 2

It might be possible to write a cdef function to convert a C array pointer + known length into a tuple in this faster way, which would hide this use of the CPython API here. I'll look at that and measure it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abstracting this out to a cdef inline function seems to have no measurable overhead and definitely makes this more readable, so I've updated this PR to do that.

# Construct shape and strides tuples using the Python/C API for speed
buf.shape = cpython.PyTuple_New(dl_tensor.ndim)
for i in range(dl_tensor.ndim):
cpython.PyTuple_SET_ITEM(buf.shape, i, cpython.PyLong_FromLong(dl_tensor.shape[i]))
Copy link
Member

@leofang leofang Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use PyLong_FromLong because both shape and strides are int64_t
https://github.com/dmlc/dlpack/blob/7f393bbb86a0ddd71fde3e700fc2affa5cdce72d/include/dlpack/dlpack.h#L257-L263
we probably need to replace it with PyLong_FromLongLong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question for my understanding: Both PyTuple_SET_ITEM and PyTuple_Set_Item steal the reference of the object o (the Python int that we convert from DLPack shape/stride members). Doesn't it mean that we should increment its refcount before calling setitem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about using LongLong. I will update.

As for reference counting, we don't need to increment here because we are "giving" the reference to the int to the tuple, and we don't need to hold on to the int ourselves to do anything else with it.

Copy link
Member

@leofang leofang Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so either constructor would return the obj with refcount 1 (for some reason I was thinking it's 0), and the ownership of this ref is transferred to the tuple.

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Aug 15, 2025
@mdboom
Copy link
Contributor Author

mdboom commented Aug 15, 2025

Currently when we initialize the class we eagerly populate python attributes for things like the ptr, shape, strides, dtype, etc. There's a non-trivial cost to this where I wonder if we should instead have C types and then lazily initialize the Python types as requested.

Yes, I have been thinking these attributes should be populated lazily, which is possible because we still hold a reference to the dlpack capsule. @mdboom would you like to address this in a follow-up PR?

Yeah, that makes sense. I'll do that in a separate PR and keep this just to the tuple-construction improvements.

@mdboom
Copy link
Contributor Author

mdboom commented Aug 15, 2025

/ok to test

@mdboom mdboom requested a review from leofang August 15, 2025 15:32
@mdboom
Copy link
Contributor Author

mdboom commented Aug 15, 2025

/ok to test

@github-project-automation github-project-automation bot moved this from In Progress to In Review in CCCL Aug 15, 2025
@mdboom mdboom merged commit 8f1dd40 into NVIDIA:main Aug 15, 2025
48 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Aug 15, 2025
@mdboom mdboom deleted the issue449 branch August 15, 2025 16:56
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P1 Medium priority - Should do
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants