Draft
Conversation
Member
nathanielsimard
left a comment
There was a problem hiding this comment.
I would try to follow a bit more the Guide regarding the kernel architecture. Things like NUM_SAMPLES, items per threads, threads per block, etc. could be in a blueprint.
Also, similar to the reduce kernels, I would try to split kernels into components, so we can create a kernel that work without using plane instructions (good for the CPU backend), another that is at the plane level and finally one where the sort is done across a cube.
Also I would adopt the CubeCL naming scheme with Cube, Plane and Unit instead of Block, Warp and Lane.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Radix sorting kernels
Burn is missing GPU sorting kernels atm. This adds a basi
DeviceRadixSort
The kernels are based on the DeviceRadixSort in https://github.com/b0nes164/GPUSorting, though with some tweaks they seem to slightly outperform the b0nes version.
TODO: More performance numbers.
The OneSweep version would be even faster but isn't as portable. The code should work on any plane size as well.
TODO: Test on plane size == 8
Scan
The kernels include essentially a small prefix sum as well. One day
cubek-scanwould be great to have, but for now we can leave this as a kernel specialized to sorting.Implicit indices
One feature CUB and co don't have is to sort with indices. This is usually done by sorting tuples of (K, V) where V[i] == i. We can save having to instantiate these values and instead just use the index at compile time. I guess in a way this amounts to a fusion of an arrange + sort kernel, maybe in the future with a powerful enough fusion system that's not needed anymore.
CubeCL: tracel-ai/cubecl#1170
Burn: tracel-ai/burn#4436
Disclaimer: Initial version was done with Claude