Improve #449: Improve StridedMemoryView creation time #838

mdboom · 2025-08-14T19:58:31Z

Two changes:

Refactor the versioned/non-versioned paths to reduce the number of branches.
Create shape and strides tuples using Python/C API

The second change has a significant impact:

Before: 1.23 us +- 0.06 us
After: 1.02 us +- 0.06 us

Measured using this pyperf benchmark, based on the one in #449

import pyperf
import time


import cupy as cp
from cuda.core.experimental.utils import StridedMemoryView


inner_loops = 100_000


def bench_strided_memory_view(loops, x):
    range_it = range(loops * inner_loops)
    t0 = time.perf_counter()

    for _ in range_it:
        s = StridedMemoryView(x, -1)

    return time.perf_counter() - t0


runner = pyperf.Runner()
x = cp.empty((23, 4))
runner.bench_time_func('StridedMemoryView', bench_strided_memory_view, x, inner_loops=inner_loops)

Description

Improves #449

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-08-14T19:58:34Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copilot

Pull Request Overview

This PR improves the performance of StridedMemoryView creation by optimizing the tuple creation process and refactoring conditional logic. The changes achieve a ~17% performance improvement by using Python/C API calls instead of Python comprehensions for creating shape and strides tuples.

Refactored versioned/non-versioned DLPack path handling to reduce branching complexity
Replaced Python tuple comprehensions with direct Python/C API calls for creating shape and strides tuples
Optimized stride conversion logic in the CAI (CUDA Array Interface) path

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

cuda_core/cuda/core/experimental/_memoryview.pyx

Two changes: 1. Refactor the versioned/non-versioned paths to reduce the number of branches. 2. Create shape and strides tuples using Python/C API

kkraus14 · 2025-08-15T00:37:35Z

The changes here LGTM! To trigger CI you need to comment "/ok to test", see here for more info: https://docs.gha-runners.nvidia.com/platform/apps/copy-pr-bot/faqs/

Most people are using the StridedMemoryView class as a way to generically handle input arrays that support dlpack and/or __cuda_array_interface__ right before kernel launches which is typically interfacing to some underlying C++ code. Currently when we initialize the class we eagerly populate python attributes for things like the ptr, shape, strides, dtype, etc. There's a non-trivial cost to this where I wonder if we should instead have C types and then lazily initialize the Python types as requested.

mdboom · 2025-08-15T13:46:17Z

/ok to test

leofang · 2025-08-15T13:56:05Z

Currently when we initialize the class we eagerly populate python attributes for things like the ptr, shape, strides, dtype, etc. There's a non-trivial cost to this where I wonder if we should instead have C types and then lazily initialize the Python types as requested.

Yes, I have been thinking these attributes should be populated lazily, which is possible because we still hold a reference to the dlpack capsule. @mdboom would you like to address this in a follow-up PR?

leofang · 2025-08-15T14:08:40Z

cuda_core/cuda/core/experimental/_memoryview.pyx


    cdef StridedMemoryView buf = StridedMemoryView() if view is None else view
    buf.ptr = <intptr_t>(dl_tensor.data)
-    buf.shape = tuple(int(dl_tensor.shape[i]) for i in range(dl_tensor.ndim))
+
+    # Construct shape and strides tuples using the Python/C API for speed


Q: Does this mean Cython is not generating efficient code for us? Could it be possible that I was using the generator form to construct a tuple and Cython is bad at this syntax?

Yeah, this is pretty inefficient. The generator causes Cython to make a separate C function which is then called through the Python iterator machinery, which is a lot less efficient than a C for loop (mainly because it makes a bunch of C function calls through C function pointers). Secondly, since the length of the tuple isn't known in advance, it causes it to be realloc'ed at least 3 times:

initial size of 0

overallocate to 10

truncate to the exact value of 2

It might be possible to write a cdef function to convert a C array pointer + known length into a tuple in this faster way, which would hide this use of the CPython API here. I'll look at that and measure it.

Abstracting this out to a cdef inline function seems to have no measurable overhead and definitely makes this more readable, so I've updated this PR to do that.

leofang · 2025-08-15T14:09:43Z

cuda_core/cuda/core/experimental/_memoryview.pyx

+    # Construct shape and strides tuples using the Python/C API for speed
+    buf.shape = cpython.PyTuple_New(dl_tensor.ndim)
+    for i in range(dl_tensor.ndim):
+        cpython.PyTuple_SET_ITEM(buf.shape, i, cpython.PyLong_FromLong(dl_tensor.shape[i]))


I don't think we can use PyLong_FromLong because both shape and strides are int64_t
https://github.com/dmlc/dlpack/blob/7f393bbb86a0ddd71fde3e700fc2affa5cdce72d/include/dlpack/dlpack.h#L257-L263
we probably need to replace it with PyLong_FromLongLong

Another question for my understanding: Both PyTuple_SET_ITEM and PyTuple_Set_Item steal the reference of the object o (the Python int that we convert from DLPack shape/stride members). Doesn't it mean that we should increment its refcount before calling setitem?

Good point about using LongLong. I will update.

As for reference counting, we don't need to increment here because we are "giving" the reference to the int to the tuple, and we don't need to hold on to the int ourselves to do anything else with it.

I see, so either constructor would return the obj with refcount 1 (for some reason I was thinking it's 0), and the ownership of this ref is transferred to the tuple.

mdboom · 2025-08-15T15:07:05Z

Currently when we initialize the class we eagerly populate python attributes for things like the ptr, shape, strides, dtype, etc. There's a non-trivial cost to this where I wonder if we should instead have C types and then lazily initialize the Python types as requested.

Yes, I have been thinking these attributes should be populated lazily, which is possible because we still hold a reference to the dlpack capsule. @mdboom would you like to address this in a follow-up PR?

Yeah, that makes sense. I'll do that in a separate PR and keep this just to the tuple-construction improvements.

mdboom · 2025-08-15T15:31:55Z

/ok to test

mdboom · 2025-08-15T15:48:28Z

/ok to test

github-actions · 2025-08-15T17:09:33Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

github-project-automation bot moved this to Todo in CCCL Aug 14, 2025

github-project-automation bot added this to CCCL Aug 14, 2025

mdboom requested review from leofang and Copilot August 14, 2025 19:58

Copilot AI reviewed Aug 14, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memoryview.pyx Outdated Show resolved Hide resolved

Improve NVIDIA#449: Improve StridedMemoryView creation time

59d1b6b

Two changes: 1. Refactor the versioned/non-versioned paths to reduce the number of branches. 2. Create shape and strides tuples using Python/C API

mdboom force-pushed the issue449 branch from a66e9e5 to 59d1b6b Compare August 14, 2025 20:03

kkraus14 previously approved these changes Aug 14, 2025

View reviewed changes

github-project-automation bot moved this from Todo to In Review in CCCL Aug 14, 2025

leofang assigned mdboom Aug 15, 2025

leofang added enhancement Any code-related improvements P1 Medium priority - Should do cuda.core Everything related to the cuda.core module labels Aug 15, 2025

leofang added this to the cuda.core beta 7 milestone Aug 15, 2025

This comment has been minimized.

Sign in to view

leofang requested changes Aug 15, 2025

View reviewed changes

github-project-automation bot moved this from In Review to In Progress in CCCL Aug 15, 2025

Add carray_int64_t_to_tuple function

8a05be3

mdboom dismissed kkraus14’s stale review via 8a05be3 August 15, 2025 15:31

mdboom requested a review from leofang August 15, 2025 15:32

Move comment

df71f24

leofang approved these changes Aug 15, 2025

View reviewed changes

github-project-automation bot moved this from In Progress to In Review in CCCL Aug 15, 2025

mdboom merged commit 8f1dd40 into NVIDIA:main Aug 15, 2025
48 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Aug 15, 2025

mdboom deleted the issue449 branch August 15, 2025 16:56

Improve #449: Improve StridedMemoryView creation time #838

Improve #449: Improve StridedMemoryView creation time #838

Conversation

mdboom commented Aug 14, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Aug 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

kkraus14 commented Aug 15, 2025

Uh oh!

mdboom commented Aug 15, 2025

Uh oh!

leofang commented Aug 15, 2025

Uh oh!

This comment has been minimized.

leofang Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

mdboom Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

mdboom Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

mdboom Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdboom commented Aug 15, 2025

Uh oh!

mdboom commented Aug 15, 2025

Uh oh!

mdboom commented Aug 15, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

Uh oh!

leofang Aug 15, 2025 •

edited

Loading

leofang Aug 15, 2025 •

edited

Loading