-
Notifications
You must be signed in to change notification settings - Fork 13k
CUDA: Conv2d Tensor Core #15813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
CUDA: Conv2d Tensor Core #15813
Conversation
* removed flash-attenion definition
57aa09e
to
2cd9fb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to make your own primitives anyways, take a look at mma.cuh
. The WMMA interface NVIDIA provides for the "high-level" CUDA code is quite frankly terrible, so I exposed the tensor core PTX instructions (assembly equivalent). The practical upside is that you get a defined memory layout (important for mul_mat_q and FlashAttention but I think not here) and that you can use smaller matrix tiles (minimum is 16x8). The downside is that Volta and AMD are still lacking an implementation.
It is becoming increasingly hard to test these kind of changes with |
I only monitor the llama.cpp and ggml repositories but when it comes to convolution kernels such as this it would also be fine for me if you open PRs in sd.cpp and tag me. |
|
@mnehete32 please run tests, the output seems to be broken. ![]() |
@Green-Sky Checking |
I think the kernel was not able to launch complete threads, as launcher launches warps per each WMMA_M, WMMA_N, i will work with launch fewer threads per block, also with the |
FYI, for tensor cores you theoretically don't need shared memory at all. Each thread in a warp holds fractions of the input and output tiles in its registers. You only need shared memory to organize the data in such a way that the global memory accesses are coalesced (see |
I thought because, output to thread mapping is unknown, it changes based on architecture. I first need to load output in shared memory before storing. |
If you read the PTX documentation you'll find that all tensor core instructions have a well-defined memory layout. It's only when you try to cover all tensor core instructions with a simple interface that you run into problems. Volta has 8x8 tensor cores. Turing, Ampere, and Ada Lovelace have 16x8 tensor cores (used by |
Follow up of PR: #15635
Convolution Performance Results (Old)
FP32 (float32) Performance
FP16 (float16) Performance
Convolution Performance Results (New)
FP32 (float32) Performance
FP16 (float16) Performance
Convolution Performance Comparison (Old vs New)
FP32 (float32)
FP16 (float16)
Summary:
ggml_cuda_cast<T>
cast, to make sure it doesnt break build.