Skip to content

CI: Run some tests with compute-sanitizer #566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

carterbox
Copy link
Contributor

@carterbox carterbox commented Apr 22, 2025

Description

Runs python 3.12 pytests in the context of compute-sanitizer to check for memory issues and errors from the CUDA API.

closes #565
closes #562

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Contributor

copy-pr-bot bot commented Apr 22, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@leofang
Copy link
Member

leofang commented Apr 22, 2025

FYI right now we use the mini-CTK approach in the CI:

So compute-sanitizer is currently not available in the CI. But I assume it can be grabbed easily.

@cryos is refactoring our CI (#555). I suggest we perhaps add another standalone pipeline for running compute sanitizer?

@leofang leofang requested review from leofang and cryos April 22, 2025 19:25
@leofang leofang added P1 Medium priority - Should do CI/CD CI/CD infrastructure labels Apr 22, 2025
@cryos
Copy link
Collaborator

cryos commented Apr 22, 2025

I was going to look at tools like this next, that is a great point and something I can factor in. Looking at the proposal here picking out a test run would be reasonable, I know there are other tools we would like to run too.

@carterbox
Copy link
Contributor Author

/ok to test f27e1f4

@leofang
Copy link
Member

leofang commented Apr 22, 2025

I doubt commit f27e1f4 would work -- we'll see: #571.

Copy link

@carterbox
Copy link
Contributor Author

/ok to test 05a7068

@cryos cryos linked an issue Apr 23, 2025 that may be closed by this pull request
@cryos cryos removed a link to an issue Apr 23, 2025
@carterbox
Copy link
Contributor Author

/ok to test dde857b

@carterbox
Copy link
Contributor Author

carterbox commented Apr 23, 2025

Yay! The linux-64 tests are failing for the correct reason! (the compute sanitizer returns non-zero because it has detected issues).

https://github.com/NVIDIA/cuda-python/actions/runs/14628267586/job/41045781037?pr=566

Windows tests are failing because I have disabled them partially.

@carterbox
Copy link
Contributor Author

/ok to test a1ea51e

@carterbox carterbox force-pushed the dching/add-compute-sanitizer-to-ci branch from a1ea51e to 0430930 Compare April 24, 2025 20:26
@carterbox
Copy link
Contributor Author

/ok to test 0430930

@leofang leofang added the cuda.bindings Everything related to the cuda.bindings module label Apr 25, 2025
@leofang leofang added the cuda.core Everything related to the cuda.core module label Apr 25, 2025
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some quick comments

@carterbox
Copy link
Contributor Author

Whatever changes to cuda.bindings that prevented runtime.cudaGetDevice() from raising CUDA_ERROR_INVALID_CONTEXT when there is no existing device context, need to be backported.

@carterbox
Copy link
Contributor Author

/ok to test 9c25910

@carterbox
Copy link
Contributor Author

/ok to test 7fb013e

@carterbox carterbox marked this pull request as ready for review April 25, 2025 22:48
Copy link
Contributor

copy-pr-bot bot commented Apr 25, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@carterbox carterbox requested a review from leofang April 25, 2025 22:48

## Test-Time Environment Variables

- `CUDA_PYTHON_SANTIZER_RUNNING` : When set to 1, tests are skipped that would cause [compute-sanitizer](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) to raise an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `CUDA_PYTHON_SANTIZER_RUNNING` : When set to 1, tests are skipped that would cause [compute-sanitizer](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) to raise an error.
- `CUDA_PYTHON_SANITIZER_RUNNING` : When set to 1, tests are skipped that would cause [compute-sanitizer](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) to raise an error.

if [[ "$COMPUTE_SANITIZER_VERSION" -ge 202111 ]]; then
SANITIZER_CMD="${SANITIZER_CMD} --padding=32"
fi
echo "CUDA_PYTHON_SANITIZER_RUNNING=1" >> $GITHUB_ENV
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the CUDA_PYTHON_SANITIZER_RUNNING variable?

I see below you're doing this:

echo "COMPUTE_SANITIZER_VERSION=${COMPUTE_SANITIZER_VERSION}" >> $GITHUB_ENV

Could we just query that in the tests? E.g. for most cases:

@pytest.mark.skipif(
    os.environ.get("COMPUTE_SANITIZER_VERSION") is not None,
    reason="The compute-sanitizer is running, and this test intentionally causes an API error.",
)

Maybe later we'll have cases that want to look at the version number, then we wouldn't need anything new or special.

@@ -83,6 +84,10 @@ def test_cuda_memcpy():
assert err == cuda.CUresult.CUDA_SUCCESS


@pytest.mark.skipif(
os.environ.get("CUDA_PYTHON_SANITIZER_RUNNING", "0") == "1",
reason="The compute-sanitzer is running, and this test intentionally causes an API error.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: sanitzer → sanitizer

(this typo has 10 copies)

We could reduce the copy-paste via a helper in tests/conftest.py (we don't have that yet, but good to start one):

skipif_compute_sanitizer_is_running = pytest.mark.skipif(
    os.environ.get("CUDA_PYTHON_SANITIZER_RUNNING", "0") == "1",
    reason="The compute-sanitizer is running, and this test intentionally causes an API error.",
)

Then here it would become:

@skipif_compute_sanitizer_is_running
def test_cuda_array():

Maybe a refinement:

COMPUTE_SANITIZER_IS_RUNNING = os.environ.get("CUDA_PYTHON_SANITIZER_RUNNING", "0") == "1"

skipif_compute_sanitizer_is_running = pytest.mark.skipif(
    COMPUTE_SANITIZER_IS_RUNNING,
    reason="The compute-sanitizer is running, and this test intentionally causes an API error.",
)

Then further down (test_timing) you could use COMPUTE_SANITIZER_IS_RUNNING instead of spelling out the os.environ code again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD CI/CD infrastructure cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module P1 Medium priority - Should do
Projects
None yet
4 participants