Skip to content

Sort#77

Draft
ArthurBrussee wants to merge 32 commits intotracel-ai:mainfrom
ArthurBrussee:sort
Draft

Sort#77
ArthurBrussee wants to merge 32 commits intotracel-ai:mainfrom
ArthurBrussee:sort

Conversation

@ArthurBrussee
Copy link

@ArthurBrussee ArthurBrussee commented Feb 2, 2026

Add Radix sorting kernels

Burn is missing GPU sorting kernels atm. This adds a basi

DeviceRadixSort

The kernels are based on the DeviceRadixSort in https://github.com/b0nes164/GPUSorting, though with some tweaks they seem to slightly outperform the b0nes version.

TODO: More performance numbers.

The OneSweep version would be even faster but isn't as portable. The code should work on any plane size as well.

TODO: Test on plane size == 8

Scan

The kernels include essentially a small prefix sum as well. One day cubek-scan would be great to have, but for now we can leave this as a kernel specialized to sorting.

Implicit indices

One feature CUB and co don't have is to sort with indices. This is usually done by sorting tuples of (K, V) where V[i] == i. We can save having to instantiate these values and instead just use the index at compile time. I guess in a way this amounts to a fusion of an arrange + sort kernel, maybe in the future with a powerful enough fusion system that's not needed anymore.

CubeCL: tracel-ai/cubecl#1170
Burn: tracel-ai/burn#4436

Disclaimer: Initial version was done with Claude

Copy link
Member

@nathanielsimard nathanielsimard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to follow a bit more the Guide regarding the kernel architecture. Things like NUM_SAMPLES, items per threads, threads per block, etc. could be in a blueprint.

Also, similar to the reduce kernels, I would try to split kernels into components, so we can create a kernel that work without using plane instructions (good for the CPU backend), another that is at the plane level and finally one where the sort is done across a cube.

Also I would adopt the CubeCL naming scheme with Cube, Plane and Unit instead of Block, Warp and Lane.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants