Sort by ArthurBrussee · Pull Request #77 · tracel-ai/cubek

ArthurBrussee · 2026-02-02T23:06:24Z

Add Radix sorting kernels

Burn is missing GPU sorting kernels atm. This adds a basi

DeviceRadixSort

The kernels are based on the DeviceRadixSort in https://github.com/b0nes164/GPUSorting, though with some tweaks they seem to slightly outperform the b0nes version.

TODO: More performance numbers.

The OneSweep version would be even faster but isn't as portable. The code should work on any plane size as well.

TODO: Test on plane size == 8

Scan

The kernels include essentially a small prefix sum as well. One day cubek-scan would be great to have, but for now we can leave this as a kernel specialized to sorting.

Implicit indices

One feature CUB and co don't have is to sort with indices. This is usually done by sorting tuples of (K, V) where V[i] == i. We can save having to instantiate these values and instead just use the index at compile time. I guess in a way this amounts to a fusion of an arrange + sort kernel, maybe in the future with a powerful enough fusion system that's not needed anymore.

CubeCL: tracel-ai/cubecl#1170
Burn: tracel-ai/burn#4436

Disclaimer: Initial version was done with Claude

nathanielsimard

I would try to follow a bit more the Guide regarding the kernel architecture. Things like NUM_SAMPLES, items per threads, threads per block, etc. could be in a blueprint.

Also, similar to the reduce kernels, I would try to split kernels into components, so we can create a kernel that work without using plane instructions (good for the CPU backend), another that is at the plane level and finally one where the sort is done across a cube.

Also I would adopt the CubeCL naming scheme with Cube, Plane and Unit instead of Block, Warp and Lane.

ArthurBrussee added 24 commits January 29, 2026 16:29

WIP baseline (single threaded)

456d1a9

Partially parallel

646f997

WIP

9dab244

Proper paralellization and f32/i32 support

09fa433

Add benches

7f9f843

Cleanup

97d346d

Small speedup

fab22c4

Coalesced shared mem

945d6a3

Speedup

17e60f6

Cached digit

28059b6

Misc speedups

ed9b3f9

Better benchmark

610f3b9

Speedups

e900584

Moar faster

cb93627

~100GB/s

934c017

Dont cache digit after all

86deb42

Improve speed, measure batched perf

5d9be94

Use plane_id and local CubeCL for now

6869e87

Cleanup

02c4961

Cleanup, cleanup tests, support more types

9dc514a

Cleanup bench, just measure keys/s, fix CUDA crash, add test for OOB

df4c188

Fixes for CUDA (shared mem init), cleanup benches

23c3f25

Remove a test

9accdc5

Cleanup

584f9b3

ArthurBrussee mentioned this pull request Feb 3, 2026

Add some missing infra for radix sorting tracel-ai/cubecl#1170

Merged

ArthurBrussee added 5 commits February 3, 2026 16:07

Refactor cubek-sort to support implicit indices

2446861

Small cleanups in kernel

bcbf1a8

Simplify scan kernel a bit

89d4423

Skip mem init in scan kernel

3c1dd94

Remove unneeded write to g_scane

32be390

ArthurBrussee mentioned this pull request Feb 4, 2026

Use CubeK radix sort kernels tracel-ai/burn#4436

Draft

nathanielsimard reviewed Feb 4, 2026

View reviewed changes

ArthurBrussee added 3 commits February 11, 2026 17:30

Cleanuo & remove some unneeded args

c0ab7e2

Add support for different value sizes, cleanup

d003a48

Some more cleanup of the kernels

34ae8bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort#77

Sort#77
ArthurBrussee wants to merge 32 commits intotracel-ai:mainfrom
ArthurBrussee:sort

ArthurBrussee commented Feb 2, 2026 •

edited

Loading

Uh oh!

nathanielsimard left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurBrussee commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DeviceRadixSort

Scan

Implicit indices

Uh oh!

nathanielsimard left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArthurBrussee commented Feb 2, 2026 •

edited

Loading