CI: Run some tests with compute-sanitizer #566

carterbox · 2025-04-22T18:06:39Z

Description

Runs python 3.12 pytests in the context of compute-sanitizer to check for memory issues and errors from the CUDA API.

closes #565
closes #562

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-04-22T18:06:43Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

leofang · 2025-04-22T19:24:56Z

FYI right now we use the mini-CTK approach in the CI:

local CTK: we only grab what we need with a custom GHA

cuda-python/.github/actions/fetch_ctk/action.yml

Lines 85 to 92 in d425a88

    
                   populate_cuda_path cuda_nvcc 
        
                   populate_cuda_path cuda_cudart 
        
                   populate_cuda_path cuda_nvrtc 
        
                   populate_cuda_path cuda_profiler_api 
        
                   populate_cuda_path cuda_cccl 
        
                   if [[ "$(cut -d '.' -f 1 <<< ${{ inputs.cuda-version }})" -ge 12 ]]; then 
        
                     populate_cuda_path libnvjitlink 
        
                   fi

wheel: same is done via the optional dependencies

cuda-python/.github/workflows/test-wheel-linux.yml

Lines 218 to 222 in d425a88

    
                       if [[ "${{ inputs.local-ctk }}" == 1 ]]; then 
        
                         pip install *.whl 
        
                       else 
        
                         pip install $(ls *.whl)[all] 
        
                       fi

conda: not yet implemented (CI: Test conda-based workflows #280) but should be straightforward as you know

So compute-sanitizer is currently not available in the CI. But I assume it can be grabbed easily.

@cryos is refactoring our CI (#555). I suggest we perhaps add another standalone pipeline for running compute sanitizer?

cryos · 2025-04-22T19:58:28Z

I was going to look at tools like this next, that is a great point and something I can factor in. Looking at the proposal here picking out a test run would be reasonable, I know there are other tools we would like to run too.

carterbox · 2025-04-22T22:16:15Z

/ok to test f27e1f4

leofang · 2025-04-22T22:25:46Z

I doubt commit f27e1f4 would work -- we'll see: #571.

github-actions · 2025-04-22T22:34:54Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-566/
https://nvidia.github.io/cuda-python/pr-preview/pr-566/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-566/cuda-bindings/
Preview will be ready when the GitHub Pages deployment is complete.

carterbox · 2025-04-23T17:34:24Z

/ok to test 05a7068

carterbox · 2025-04-23T21:15:16Z

/ok to test dde857b

carterbox · 2025-04-23T21:57:10Z

Yay! The linux-64 tests are failing for the correct reason! (the compute sanitizer returns non-zero because it has detected issues).

https://github.com/NVIDIA/cuda-python/actions/runs/14628267586/job/41045781037?pr=566

Windows tests are failing because I have disabled them partially.

carterbox · 2025-04-24T18:26:05Z

/ok to test a1ea51e

There is no compute-sanitizer wheel, so we can only run when the ctk is installed system-wide

Because the sanitizer commands depend on the version of the sanitizer we need to be able to run the sanitzer to set the sanitizer cmd. Thus, we need to setup the sanitzer after it is installed.

carterbox · 2025-04-24T20:28:24Z

/ok to test 0430930

leofang

Leaving some quick comments

cuda_core/tests/test_event.py

.github/workflows/test-wheel-linux.yml

carterbox · 2025-04-25T19:24:44Z

Whatever changes to cuda.bindings that prevented runtime.cudaGetDevice() from raising CUDA_ERROR_INVALID_CONTEXT when there is no existing device context, need to be backported.

carterbox · 2025-04-25T20:27:12Z

/ok to test 9c25910

carterbox · 2025-04-25T21:47:34Z

/ok to test 7fb013e

copy-pr-bot · 2025-04-25T22:48:23Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

leofang · 2025-04-26T03:29:26Z

Let's get #562 (comment) addressed and merge!

leofang · 2025-04-26T03:25:32Z

cuda_bindings/docs/source/environment_variables.md

+
+## Test-Time Environment Variables
+
+- `CUDA_PYTHON_SANTIZER_RUNNING` : When set to 1, tests are skipped that would cause [compute-sanitizer](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) to raise an error.


Suggested change

- `CUDA_PYTHON_SANTIZER_RUNNING` : When set to 1, tests are skipped that would cause [compute-sanitizer](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) to raise an error.

- `CUDA_PYTHON_SANITIZER_RUNNING` : When set to 1, tests are skipped that would cause [compute-sanitizer](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) to raise an error.

.github/workflows/test-wheel-linux.yml

rwgk · 2025-04-26T18:00:22Z

.github/workflows/test-wheel-linux.yml

+            if [[ "$COMPUTE_SANITIZER_VERSION" -ge 202111 ]]; then
+                SANITIZER_CMD="${SANITIZER_CMD} --padding=32"
+            fi
+            echo "CUDA_PYTHON_SANITIZER_RUNNING=1" >> $GITHUB_ENV


Do we really need the CUDA_PYTHON_SANITIZER_RUNNING variable?

I see below you're doing this:

echo "COMPUTE_SANITIZER_VERSION=${COMPUTE_SANITIZER_VERSION}" >> $GITHUB_ENV

Could we just query that in the tests? E.g. for most cases:

@pytest.mark.skipif( os.environ.get("COMPUTE_SANITIZER_VERSION") is not None, reason="The compute-sanitizer is running, and this test intentionally causes an API error.", )

Maybe later we'll have cases that want to look at the version number, then we wouldn't need anything new or special.

rwgk · 2025-04-26T18:03:08Z

cuda_bindings/tests/test_cuda.py

@@ -83,6 +84,10 @@ def test_cuda_memcpy():
    assert err == cuda.CUresult.CUDA_SUCCESS


+@pytest.mark.skipif(
+    os.environ.get("CUDA_PYTHON_SANITIZER_RUNNING", "0") == "1",
+    reason="The compute-sanitzer is running, and this test intentionally causes an API error.",


typo: sanitzer → sanitizer

(this typo has 10 copies)

We could reduce the copy-paste via a helper in tests/conftest.py (we don't have that yet, but good to start one):

skipif_compute_sanitizer_is_running = pytest.mark.skipif( os.environ.get("CUDA_PYTHON_SANITIZER_RUNNING", "0") == "1", reason="The compute-sanitizer is running, and this test intentionally causes an API error.", )

Then here it would become:

@skipif_compute_sanitizer_is_running def test_cuda_array():

Maybe a refinement:

COMPUTE_SANITIZER_IS_RUNNING = os.environ.get("CUDA_PYTHON_SANITIZER_RUNNING", "0") == "1" skipif_compute_sanitizer_is_running = pytest.mark.skipif( COMPUTE_SANITIZER_IS_RUNNING, reason="The compute-sanitizer is running, and this test intentionally causes an API error.", )

Then further down (test_timing) you could use COMPUTE_SANITIZER_IS_RUNNING instead of spelling out the os.environ code again.

leofang requested review from leofang and cryos April 22, 2025 19:25

leofang assigned carterbox Apr 22, 2025

leofang added P1 Medium priority - Should do CI/CD CI/CD infrastructure labels Apr 22, 2025

cryos linked an issue Apr 23, 2025 that may be closed by this pull request

CI: Refactor the CI matrix #576

Open

cryos removed a link to an issue Apr 23, 2025

CI: Refactor the CI matrix #576

Open

carterbox added 10 commits April 24, 2025 15:25

CI: Add compute-sanitizer paths to linux test environment

843130f

CI: Run pytest in the context of compute-sanitizer

c8df0dc

CI: Only use compute-sanitizer for tests of one python version

4ac79d0

CI: Add non-zero exitcode to compute-sanitizer

be912c9

CI: Only run compute-sanitzer when testing against local ctk

865eeb5

There is no compute-sanitizer wheel, so we can only run when the ctk is installed system-wide

CI: Delay CUDA_HOME variable expansion

ddd0714

CI: Move compute-sanitzer setup into own step after CTK setup

5079605

Because the sanitizer commands depend on the version of the sanitizer we need to be able to run the sanitzer to set the sanitizer cmd. Thus, we need to setup the sanitzer after it is installed.

CI: Optionally skip tests that raise CUDA API errors

4a03465

CI: Add sanitizer skip environment variable to CI

bd88039

DOC: Fix spelling of CI step name

0430930

carterbox force-pushed the dching/add-compute-sanitizer-to-ci branch from a1ea51e to 0430930 Compare April 24, 2025 20:26

leofang added the cuda.bindings Everything related to the cuda.bindings module label Apr 25, 2025

leofang added the cuda.core Everything related to the cuda.core module label Apr 25, 2025

leofang reviewed Apr 25, 2025

View reviewed changes

cuda_core/tests/test_event.py Show resolved Hide resolved

.github/workflows/test-wheel-linux.yml Show resolved Hide resolved

.github/workflows/test-wheel-linux.yml Outdated Show resolved Hide resolved

carterbox added 4 commits April 25, 2025 11:23

CI: Skip state failure test when running sanitizer

95c3914

CI: Skip linker error log test when sanitizer is running

b091340

CI: Add note explaining test skip

10d4c7a

DOC: Document CUDA_PYTHON_SANITIZER_RUNNING

bdfb547

CI: Skip compute-sanitizer on CTK11

9c25910

BUG: Correctly spell "version"

7fb013e

carterbox marked this pull request as ready for review April 25, 2025 22:48

carterbox requested a review from leofang April 25, 2025 22:48

leofang reviewed Apr 26, 2025

View reviewed changes

rwgk reviewed Apr 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Run some tests with compute-sanitizer #566

CI: Run some tests with compute-sanitizer #566

carterbox commented Apr 22, 2025 •

edited

Loading

copy-pr-bot bot commented Apr 22, 2025

leofang commented Apr 22, 2025

cryos commented Apr 22, 2025

carterbox commented Apr 22, 2025

leofang commented Apr 22, 2025

github-actions bot commented Apr 22, 2025

Preview will be ready when the GitHub Pages deployment is complete.

carterbox commented Apr 23, 2025

carterbox commented Apr 23, 2025

carterbox commented Apr 23, 2025 •

edited

Loading

carterbox commented Apr 24, 2025

carterbox commented Apr 24, 2025

leofang left a comment

carterbox commented Apr 25, 2025

carterbox commented Apr 25, 2025

carterbox commented Apr 25, 2025

copy-pr-bot bot commented Apr 25, 2025

leofang commented Apr 26, 2025

leofang Apr 26, 2025

rwgk Apr 26, 2025

rwgk Apr 26, 2025


		## Test-Time Environment Variables

		- `CUDA_PYTHON_SANTIZER_RUNNING` : When set to 1, tests are skipped that would cause [compute-sanitizer](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) to raise an error.

	- `CUDA_PYTHON_SANTIZER_RUNNING` : When set to 1, tests are skipped that would cause [compute-sanitizer](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) to raise an error.
	- `CUDA_PYTHON_SANITIZER_RUNNING` : When set to 1, tests are skipped that would cause [compute-sanitizer](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) to raise an error.

CI: Run some tests with compute-sanitizer #566

Are you sure you want to change the base?

CI: Run some tests with compute-sanitizer #566

Conversation

carterbox commented Apr 22, 2025 • edited Loading

Description

Checklist

copy-pr-bot bot commented Apr 22, 2025

leofang commented Apr 22, 2025

cryos commented Apr 22, 2025

carterbox commented Apr 22, 2025

leofang commented Apr 22, 2025

github-actions bot commented Apr 22, 2025

Preview will be ready when the GitHub Pages deployment is complete.

carterbox commented Apr 23, 2025

carterbox commented Apr 23, 2025

carterbox commented Apr 23, 2025 • edited Loading

carterbox commented Apr 24, 2025

carterbox commented Apr 24, 2025

leofang left a comment

Choose a reason for hiding this comment

carterbox commented Apr 25, 2025

carterbox commented Apr 25, 2025

carterbox commented Apr 25, 2025

copy-pr-bot bot commented Apr 25, 2025

leofang commented Apr 26, 2025

leofang Apr 26, 2025

Choose a reason for hiding this comment

rwgk Apr 26, 2025

Choose a reason for hiding this comment

rwgk Apr 26, 2025

Choose a reason for hiding this comment

carterbox commented Apr 22, 2025 •

edited

Loading

carterbox commented Apr 23, 2025 •

edited

Loading