TODO: These notes (and code) are incomplete. My goal is to eventually provide a step-by-step guide and introduction to a simple radix sort implementation, and try to cover the basic issues which are sometimes glossed over in text books and courses. Furthermore, WHILE THIS NOTE PERSISTS, I MAY FORCE PUSH TO MASTER
These are my notes on engineering a radix sort.
The ideas behind radix sorting are not new in any way, but seems to have become, if not forgotten, so at least under-utilized, as over the years quicksort became the go-to "default" sorting algorithm for internal sorting.
Unless otherwise specified, this code is written foremost to be clear and easy to understand.
All code is provided under the MIT License.
- Motivation
- Building
- From the top; Counting sort
- All together now; Radix sort
- Immutability / Returning indeces
- Listing 6: Rank sort
- Key derivation
- Optimizations
- C++ Implementation
- Downsides
- Esoterics
- Uniquely sorting with bitmaps
- Listing 7: Bitmap sort
- Todo
- Contributors
- Resources
This code can sort 40 million 32-bit integers in under half a second using a single
core of an Intel i5-3570T,
a low-TDP CPU from 2012 using DDR3-1333. std::sort
requires ~3.5s for the same task (with the
caveat that it's in-place).
The included Makefile
will build all the code on any reasonably up-to-date linux installation.
$ make
The default build enables a lot of compiler warning flags and optimizations, but care has been taken to make sure the examples are easy to build and run individually too;
$ cc radix_sort_u32.c -o example1
$ ./example1
The radix
program and the benchmark expects a file named 40M_32bit_keys.dat
to exist. This file is
generated by running make bench
or make genkeys
.
See the section on C++ Implementation for more information on how to use the test program and benchmark.
Possibly the simplest way to sort an array of integers, is to simply count how many there are of each, and use those counts to write the result.
This is the most basic counting sort.
static void counting_sort_8(uint8_t *arr, size_t n)
{
size_t cnt[256] = { 0 };
size_t i;
// Count number of occurrences of each octet.
for (i = 0 ; i < n ; ++i) {
cnt[arr[i]]++;
}
// Write cnt_a a's to the array in order.
i = 0;
for (size_t a = 0 ; a < 256 ; ++a) {
while (cnt[a]--)
arr[i++] = a;
}
}
You could easily extend this code to work with 16-bit values, but if you try to push much further the drawbacks
of this counting sort become fairly obvious; you need a location to store the count for each unique
integer. For 8- and 16-bit numbers this would amount to 2^8*4
=1KiB and 2^16*4
=256KiB of memory. For
32-bit integers, it'd require 2^32*4
=16GiB of memory1. Multiply by two if you need 64- instead of 32-bit counters.
Again, we're only sorting an array of integers, nothing more. It's not immediately obvious how we could use this to sort records with multiple pieces of data.
As presented, this sort is in-place, but since it's not moving any elements, it doesn't really make sense to think of it as being stable or unstable.
An in-place sort is a sorting algorithm where the amount of extra memory used does not depend on the size of the input.
A stable sort is one where records with equal keys keep their relative position in the sorted output.
To get us closer to radix sorting, we now need to consider a slightly more general variant where we're, at least conceptually, rearranging input elements:
static void counting_sort_8s(uint8_t *arr, uint8_t *aux, size_t n)
{
size_t cnt[256] = { 0 };
size_t i;
// Count number of occurrences of each octet.
for (i = 0 ; i < n ; ++i) {
cnt[arr[i]]++;
}
// Calculate prefix sums.
size_t a = 0;
for (int j = 0 ; j < 256 ; ++j) {
size_t b = cnt[j];
cnt[j] = a;
a += b;
}
// Sort elements
for (i = 0 ; i < n ; ++i) {
// Get the key for the current entry.
uint8_t k = arr[i];
// Find the location this entry goes into in the output array.
size_t dst = cnt[k];
// Copy the current entry into the right place.
aux[dst] = arr[i];
// Make it so that the next 'k' will be written after this one.
// Since we process source entries in increasing order, this makes us a stable sort.
cnt[k]++;
}
}
We have introduced a separate output array, which means we are no longer in-place. This auxiliary array is required; the algorithm would break if we tried to write directly into the input array.
However, the main difference between this and the first variant is that we're no longer directly writing the output from the counts. Instead the counts are re-processed into a prefix sum (or exclusive scan) in the second loop.
This gives us, for each input value, its first location in the sorted output array, i.e
the value of cnt[j]
tells us the array index at which to write the first j to the output, because cnt[j]
contains the sum count of all elements that would precede ('sort lower than') j.
For instance, cnt[0]
will always be zero, because any 0
will always end up in the first
position in the output (we're assuming only non-negative integers for now). cnt[1]
will contain
how many zeroes precede the first 1
, cnt[2]
will contain how many zeroes and ones precede
the first 2
, and so on.
In the sorting loop, we look up the output location for the key of the entry we're sorting, and write the entry there. We then increase the count of the prefix sum for that key by one, which guarantees that the next same-keyed entry is written just after.
Because we are processing the input entries in order, from the lowest to the highest index, and preserving this order when we write them out, this sort is in essence stable. That said, it's a bit of a pointless distinction since without any other data associated with the keys, there's nothing that distinguishes same keys from one another.
With a few basic modifications, we arrive at:
static void counting_sort_rec_sk(struct sortrec *arr, struct sortrec *aux, size_t n)
{
size_t cnt[256] = { 0 };
size_t i;
// Count number of occurrences of each key.
for (i = 0 ; i < n ; ++i) {
uint8_t k = key_of(arr + i);
cnt[k]++;
}
// Calculate prefix sums.
size_t a = 0;
for (int j = 0 ; j < 256 ; ++j) {
size_t b = cnt[j];
cnt[j] = a;
a += b;
}
// Sort elements
for (i = 0 ; i < n ; ++i) {
// Get the key for the current entry.
uint8_t k = key_of(arr + i);
size_t dst = cnt[k];
aux[dst] = arr[i];
cnt[k]++;
}
}
We are now sorting an array of struct sortrec
, not an array of octets. The name and the type of the struct
here is arbitrary; it's only used to show that we're not restricted to sorting arrays of integers.
The primary modification to the sorting function is the small addition of a function key_of()
, which returns
the key for a given record.
The main insight you should take away from this is that, to sort record types, we just need some way to extract or derive a key for each entry.
We're still restricted to integer keys. We rely on there being some sort of mapping from our records (or entries) to the integers which orders the records the way we require.
Here we still use a single octet inside the struct sortrec
as the key. Associated with each key
is a short string. This allows us to demonstrate that a) sorting keys with associated data is not a problem,
and b) the sort is indeed stable.
Running the full program demonstrates that each like-key is output in the same order it came in the input array, i.e the sort is stable.
$ ./counting_sort_rec_sk
00000000: 1 -> 1
00000001: 2 -> 2
00000002: 3 -> 3
00000003: 45 -> 1st 45
00000004: 45 -> 2nd 45
00000005: 45 -> 3rd 45
00000006: 255 -> 1st 255
00000007: 255 -> 2nd 255
Now we are ready to take the step from counting sorts to radix sorts.
On a high level, the radix sort we'll cover next uses the counting sort we have already discussed, but overcome the inherent limitation of counting sorts to deal with large magnitude (or wide) keys by using multiple passes, each pass processing only a part of the key.
Some texts describe this as looking at the individual digits of an integer key, which you can process digit-by-digit via a series of modulo (remainder) and division operations with the number 10, the base or radix.
In a computer we deal with bits, which means the base is inherently two, not ten.
Because working with individual bits is in some sense "slow", we group multiple of them together into units that are either faster and/or more convenient to work with. One such grouping is into strings of eight bits, called octets or bytes.
An octet can represent 2^8
or 256
different values. In other words, just
as processing a base-10 number digit-by-digit is operating in radix-10, processing
a binary number in units of eight bits means operating in radix-256.
Since we're not going to do any math with the keys, it may help to conceptually consider them simply as bit-strings instead of numbers. This gets rid of some baggage which isn't useful to the task at hand.
Because we're using a computer and operate on bits, instead of division and modulo operations, we use bit-shifts and bit-masking.
Below is a table of random 32-bit keys written out in hexadecimal, or base-16, which is a convenient notation for us. In hexadecimal a group of four bits is represented with a symbol (digit) from 0-F (0-9 plus A-F), and consequently a group of eight bits is represented by two such symbols.
32-bit key | A | B | C | D |
---|---|---|---|---|
7A8F97A4 | 7A | 8F | 97 | A4 |
F728B2E2 | F7 | 28 | B2 | E2 |
517833CD | 51 | 78 | 33 | CD |
9332B72F | 93 | 32 | B7 | 2F |
A35138CD | A3 | 51 | 38 | CD |
BBAD9DAF | BB | AD | 9D | AF |
B2667C54 | B2 | 66 | 7C | 54 |
8C8E59A6 | 8C | 8E | 59 | A6 |
If you consider this table, with the four 8-bit wide columns marked A through D, there's a choice to be made; if we're going to process these keys top to bottom, one column at a time, in which order do we process the columns?
This choice gives rise to the two main classes of radix sorts, those that are Least Significant Bits (LSB, bottom-up) first and those that are Most Significant Bits (MSB, top-down) first. Sometimes 'digit' is substituted for 'bits', it's the same thing.
If you have prior experience you may already know that, based on the material presented so far, we're going down the LSB path, meaning we'll process the columns from right to left; D, C, B, A.
In our counting sorts, the key width and the radix (or column) width were the same; 8-bits. In a radix sort the column width will be less or equal to the key width, and in a LSB radix sort we'll be forced to make multiple passes over the input to make up the difference. The wider our radix the more memory (to track counts), but fewer passes we'll need. This is a tradeoff.
The assertion then, and we will demonstrate this to be true, is that if we apply counting sort by column D, and then apply counting sort on that result by column C, and so forth, after the last column (A) is processed, our data will be sorted and this sort is stable.
This is radix sort.
static void radix_sort_u32(struct sortrec *arr, struct sortrec *aux, size_t n)
{
size_t cnt0[256] = { 0 };
size_t cnt1[256] = { 0 };
size_t cnt2[256] = { 0 };
size_t cnt3[256] = { 0 };
size_t i;
// Generate histograms
for (i = 0 ; i < n ; ++i) {
uint32_t k = key_of(arr + i);
uint8_t k0 = (k >> 0) & 0xFF;
uint8_t k1 = (k >> 8) & 0xFF;
uint8_t k2 = (k >> 16) & 0xFF;
uint8_t k3 = (k >> 24) & 0xFF;
++cnt0[k0];
++cnt1[k1];
++cnt2[k2];
++cnt3[k3];
}
// Calculate prefix sums.
size_t a0 = 0;
size_t a1 = 0;
size_t a2 = 0;
size_t a3 = 0;
for (int j = 0 ; j < 256 ; ++j) {
size_t b0 = cnt0[j];
size_t b1 = cnt1[j];
size_t b2 = cnt2[j];
size_t b3 = cnt3[j];
cnt0[j] = a0;
cnt1[j] = a1;
cnt2[j] = a2;
cnt3[j] = a3;
a0 += b0;
a1 += b1;
a2 += b2;
a3 += b3;
}
// Sort in four passes from LSB to MSB
for (i = 0 ; i < n ; ++i) {
uint32_t k = key_of(arr + i);
uint8_t k0 = (k >> 0) & 0xFF;
size_t dst = cnt0[k0]++;
aux[dst] = arr[i];
}
SWAP(arr, aux);
for (i = 0 ; i < n ; ++i) {
uint32_t k = key_of(arr + i);
uint8_t k1 = (k >> 8) & 0xFF;
size_t dst = cnt1[k1]++;
aux[dst] = arr[i];
}
SWAP(arr, aux);
for (i = 0 ; i < n ; ++i) {
uint32_t k = key_of(arr + i);
uint8_t k2 = (k >> 16) & 0xFF;
size_t dst = cnt2[k2]++;
aux[dst] = arr[i];
}
SWAP(arr, aux);
for (i = 0 ; i < n ; ++i) {
uint32_t k = key_of(arr + i);
uint8_t k3 = (k >> 24) & 0xFF;
size_t dst = cnt3[k3]++;
aux[dst] = arr[i];
}
}
The function radix_sort_u32()
builds on counting_sort_rec_sk()
in a straight-forward manner
by introducing four counting sort passes. The outer (pass) loop is unrolled by design, to show off the pattern.
The four histograms are generated in one pass through the input. These are re-processed into prefix sums in a separate step. A speed vs memory trade-off can be had by not pre-computing all the histograms.
We then sort columns D through A, swapping (a.k.a ping-ponging) the input and output buffer between the passes.
After the 4:th (final) sorting pass, since this is an even number, the final result is available in the input buffer.
A quick performance note: Having four distinct arrays for the histograms will almost certainly generate imperfect code,
e.g you're likely to end up with separate calls to memset
.
Extending to larger keys, e.g 64-bit keys and beyond, can be achieved by simply adding more passes, but for completeness, let us instead look at how we can augment the function to sort in multiple passes via a key-shift argument, as this is a very common implementation detail:
struct sortrec {
uint64_t key;
const char *name;
};
static uint64_t key_of(const struct sortrec *rec) {
return rec->key;
}
static void radix_sort_u32_multipass(struct sortrec *arr, struct sortrec *aux, size_t n, unsigned int keyshift)
{
size_t cnt0[256] = { 0 };
size_t cnt1[256] = { 0 };
size_t cnt2[256] = { 0 };
size_t cnt3[256] = { 0 };
size_t i;
// Generate histograms
for (i = 0 ; i < n ; ++i) {
uint32_t k = key_of(arr + i) >> keyshift;
At each point we retrieve the key via key_of
, we shift out any bits we've already processed in a previous pass,
passed in as keyshift
.
We can then sort 64-bit wide by calling the sort function twice as such:
radix_sort_u32_multipass(arr, aux, N, 0);
radix_sort_u32_multipass(arr, aux, N, 32);
The first call will sort on bits 0-31 of the key, the second call on bits 32-63.
This is only possible because the sort is stable.
So far we've mostly had our code rearrange the values in the input array itself, but it's sometimes desirable to be able to treat the input as immutable.
The obvious solution is to generate an array of pointers into the input and then sort those pointers. This is usually how we do things when sorting record types (or strings) in a C-style language, since rearranging these implies copying a lot of data, which can be expensive, and because we can have multiple pointer arrays representing different sort orders over the same input.
An alternative is to sort using indeces instead. You can think of it as assigning
a number between 0
and N-1
to each object to be sorted, and then returning a permutation
of these numbers that represents the sorted order. Some know this by the name argsort.
Example: Take as input the array A = { 2, 42, 1 }
. Assuming zero-based indexing, the rank-array representing
an ascending sort of the input is R = { 2, 0, 1 }
. I.e { A[R[0]], A[R[1]], A[R[2]], ... A[R[N-1]]}
= { 1, 2, 42, ... }
Having the sort return indeces is also useful if we need to sort something which is split over
different arrays, since the result can be applied to the original input as well as any 'parallel'
arrays. This is demonstrated in the example code below, where we decouple name
from
the struct whose key we're sorting on.
So the goal is to modify our sorting function to return R
instead of permuting A
. We can achieve this with quite minor changes:
static uint32_t *radix_sort_u32_index(const struct sortrec * const arr, uint32_t *indeces, size_t n)
[...]
// Sort in four passes from LSB to MSB
for (i = 0 ; i < n ; ++i) {
uint8_t k0 = key_of(arr + i);
size_t dst = cnt0[k0]++;
indeces2[dst] = i;
}
for (i = 0 ; i < n ; ++i) {
uint8_t k1 = key_of(arr + indeces2[i]) >> 8;
size_t dst = cnt1[k1]++;
indeces[dst] = indeces2[i];
}
[...]
return indeces;
First note in the prototype that we declare the input array as const. This
means we can't write to arr
, and hence we guarantee that the input is undisturbed.
Instead of the old aux
array we accept an array indeces
, which has been allocated to be twice
the number of entries to be sorted. It's twice the length of the original array because we
still need space to ping-pong reads and writes while sorting, and the input array is
no longer available for writing.
In the first pass we read the input in order and write out indeces in the correct positions in the indeces buffer. In subsequent passes we alternate reading via, and writing the indeces buffers.
As with sorting via pointers, the extra indirection from looking up the key via the indeces array is likely to have quite negative implications for performance due to the added data-dependency. There are solutions to this, which we'll have to revisit here at a later date.
On the plus side, we have now decoupled the size of the auxiliary buffer(s) from the size of the objects in the input array. Yes, we need to allocate a buffer twice the length of that when we're sorting by value, but we only need room for two indeces per entry, so the size of the auxilary buffer(s) is directly related to the number of objects being sorted. So for example, if we're sorting fewer than 2^16 objects, we can use 16-bit indeces to save space and improve cache behaviour.
First let's get the question of sorting strings out of the way.
LSB radix sort is inherently columnar. The keys must all be the same width, and narrower is better to keep the number of passes down. To sort variable length keys you would first have to pad all keys on the left (MSD) side until they're the length of the longest key.
This is to say, an LSB implemention is not well suited to sort character strings of any significant length. For those, a MSB radix sort is better.
For sorting floats, doubles or other fixed-width keys, or changing the sort order, we can still use the LSB implementation as-is. The trick is to map the keys onto the unsigned integers.
I'll use the term key-derivation for this, because we're not only extracting different parts of a key; using this function to manipulate the key in different ways to derive a sort key is how we're able to handle signed integers, floating point keys, and chose a sort order.
The natural ordering for the unsigned integers is in ascending order. If we instead want to sort in descending (reverse) order, simply have the key-derivation function return the bitwise inverse (or complement) of the key:
return ~key; // unsigned (desc)
Note that for like-keys, the first record in the forward-direction of the input will also be the first record in the output. If this is not what you want, I suggest using a standard ascending sort and reading the result backwards, which obviously will give you the last like-key record first.
To treat the key as a signed integer, we need to manipulate the sign-bit, which is the top-most 'most significant' bit, since by default this is set for negative numbers, meaning they would appear at the end of the result. Using the xor operator we flip the top bit, which neatly solves the problem:
return key ^ (1UL << 31); // signed 32-bit (asc)
These can be combined to handle signed integers in descending order:
return ~(key ^ (1UL << 31)); // signed 32-bit (desc)
To sort IEEE 754 single-precision (32-bit) floats (a.k.a binary32) in their natural order we need to invert the key if the sign-bit is set, else we need to flip the sign bit. (via Radix Tricks):
return key ^ (-(key >> 31) | (1UL << 31)); // 32-bit float (asc)
This looks complex, but the left side of the parenthetical converts a set sign-bit to an all-set bitmask (-1 equals ~0) which causes the xor
to
invert the whole key. The second expression in the parenthetical (after the bitwise or) sets the sign bit, which is a no-op if it was already set, but
otherwise ensures that the xor
flips the sign-bit only.
As an implementation detail for C and C++, key
is the floating point key reinterpreted as an unsigned 32-bit integer to allow
the bit-manipulation. This sort of type-punning, if done naively, can be considered undefined behaviour (UB). Post C++20 you should
have the option to use bit_cast<T>()
to do this in a well-defined way, but without that option we instead use memcpy
into a local
temporary. This pattern is recognized by compilers, but as always you should inspect the generated code to make sure.
Example for sorting { 128.0f, 646464.0f, 0.0f, -0.0f, -0.5f, 0.5f, -128.0f, -INFINITY, NAN, INFINITY }
:
00000000: ff800000 -inf
00000001: c3000000 -128.000000
00000002: bf000000 -0.500000
00000003: 80000000 -0.000000
00000004: 00000000 0.000000
00000005: 3f000000 0.500000
00000006: 43000000 128.000000
00000007: 491dd400 646464.000000
00000008: 7f800000 inf
00000009: 7fc00000 nan
So yes, you can sort with NaNs. Isn't life great?
All of these of course extends naturally to 64-bit keys, just change the shifts from 31 to 63 and adjust the types involved accordingly.
In this section I'll talk about some optimizations that I have tried, or that may be worth investigating.
At this point it must be noted that none of the code in this repository even attempts to defeat side-channel leaks, and some of the following optimizations add data-dependencies that will definitely open the door for timing attacks should you apply them in a cryptographic context.
I have observed a few different radix sort implementations, and some of them have a larger fixed overhead than others. This gives rise to the idea of hybrid sorts, where you redirect small workloads (e.g N < 100) to say an insertion sort. This is often a part of MSB radix sorts.
It's also possible to do one MSB pass and then LSB sort the sub-results. This saves memory and memory management from doing all MSB passes, and should improve cache locality for the LSB passes. In the one implementation I've tried, the fixed overhead was quite high. (TODO: experiment/determine exact reason)
Since we have to scan once through the input to build our histograms, it's relatively easy to add code there to detect if the input is already sorted/reverse-sorted, and special case that.
One way to implement it is to initialize a variable to the number of elements to be sorted, and decrement it every time the current element is less-than-or-equal to the next one, care of bounds. After a pass through the input, you would have a count of the number of elements that need to be sorted. If this number is less than two, you're done.
If there's a user-defined key-derivation function, apply it to the values being compared.
Pushing further, say trying to detect already sorted columns, didn't seem worth the effort in my experiments. You don't want to add too many conditionals to the histogram loop.
If every key has the same value for a key-radix aligned portion, then the sort loop for that column will simply be a copy from one buffer to another, which is a waste.
Fortunately detecting a 'trivial pass' is easy, and does not -- as is sometimes shown -- require a loop over the histogram looking for a non-zero count and checking if it's equal the number of keys being sorted.
Instead we can simply sample any key, and check the corresponding histogram entries directly.
We take the first key (any will do), extract all the subkeys using whatever key-radix we're using, and then we directly probe those entries in the histogram(s). No searching required.
Iff the corresponding count is equal to the number of keys to be sorted, all the values in the column are the same and it can be skipped.
EASTL use this optimization.
Notice how you could sample any key, and the result would be the same.
int cols[4];
int ncols = 0;
uint32_t key0 = key_of(*src); // sample first entry
uint8_t k0 = (key0 >> 0);
uint8_t k1 = (key0 >> 8);
[...]
if (cnt0[k0] != num_entries)
cols[ncols++] = 0;
if (cnt1[k1] != num_entries)
cols[ncols++] = 1;
[...]
// .. here the cols array contains the index of the ncols columns we must process.
Unlikely to be a win in the common case, but as a potential micro-optimization you could xor
any
two keys and ignore histograms for which the combined bits are not all zero.
Skipping columns has the side-effect of the final sorted result ending up in either the original input buffer or the auxiliary buffer, depending on whether you sorted an even or odd number of columns. I prefer to have the sort function return the pointer to the result, rather than add a copy step.
This section is under development, speculative, and has not been implemented
The opportunity for column-skipping is fast and easy to detect, and efficient when activated, but suffers from limited opportunity in practice due to the requirement that every input entry has the same value in the column. The higher the radix (wider column), the fewer opportunities for skipping.
If we reduced the column width to one single bit, then any column with all 0s or all 1s could be skipped, at the cost of having to do key-width number of passes worst-case.
The insight here is that we only care about columns with mixed values, so we should never have to do more passes than there are such columns. The problem is that we do not want to use radix-1 because looking at a single bit at a time is not efficient in practice.
However, this gives rise to the following idea; we should be able to re-arrange the bit-columns in the key, as long as we preserve the relative order of the underlying sort keys, such that we maximize opportunity for column skipping.
In other words, if we know which 1-bit columns are relevant, i.e those which are not all the same bit-value, then we could partition the keys such that all relevant columns move to towards the LSB, and all irrelevant towards the MSB, and as long as we preserve the relative order of relevant columns, this should create opportunity for a wider radix to skip columns at the MSB end of the key.
Calculating a mask of which columns are relevant can be done efficiently in one pass using simple bit-operations. We already do this pass to generate histograms, so little performance is lost here.
The mask tells us the maximum amount of bit-compaction that can be achieved, and so we can decide to skip it if it wouldn't expose any whole radix-N columns.
One basic first approach would be to use the mask to shift out/ignore low bits that are the same, as long as we can shift enough that a full column becomes available for skipping.
More optimally, the mask can be used together with a parallel bit-extraction instruction,
where available, to pack the keys. On AMD64 this instruction is called PEXT
and is part of the BMI2 instruction set.
This is an active area of research.
Unlike a typical implementation, which would likely repeatedly call a counting sort and therefore need as many histogramming passes as there are sorting passes, the idea of pre-calculating the histograms for multiple sorting passes at once comes naturally when writing the radix sort in the unrolled form.
Reducing the size of the histograms by using the smallest counter type possible
will have a positive performance impact due to better cache utilization.
If you can get away with 32-bit or even 16-bit counters, it will almost certainly
pay off. In an experiment, halving the histogram size by changing the counter type
from size_t
to uint32_t
improved performance when sorting N<100 items by ~33%.
This effect fades out as the input size grows in proportion to the histograms.
It's possible to do the histogramming for the next pass within the sort loop of the current pass, reducing the memory footprint to two histogram buffers in the unrolled form.
EASTL, while having a conventional outer sorting loop, uses this fusing approach anyway, noting that it "avoids memory traffic / cache pressure of reading keys in a separate operation".
Using multiples of eight bits for the radix is convenient, but not required. If we do, we limit ourselves to 8-bit or 16-bit wide radixes in practice. Using an intermediate size such as 11-bits with three passes saves us one pass for every 32-bits of key and may fit a low-level cache (L1/L2) better, but also makes it less likely for the column skipping to kick in, and adds a few extra masking operations.
Going narrower could allow us to skip more columns in common workloads. There's definitely room for experimentation in this area, maybe even trying non-uniformly wide radixes (8-16-8) or dynamic selection of radix widths. If the data you're sorting isn't random, adapting the radix to the data may yield positive results, e.g if you know you are sorting floats in the range -1 to +1, or only care about a certain number of bits of precision, then this may be exploitable.
That said, in my limited initial testing, neither 4- nor 11-bit wide radix showed any promise at all in sorting random 32-bit integers. Quite the contrary, both were bad across the board.
The 4-bit version never had much going for it, so no big surprise there. Perhaps if you have to sort on a tiny microprocessor it may be worth going down in radix, but on a modern "big CPU" I'm confident saying it is not.
The 11-bit version on the other hand was surprisingly bad, and was never even close to beating the standard 8-bit radix implementation, even when reducing the counter width down to 32-bits to compensate for the larger number of buckets. This variant is one you sometimes see in the wild, and I can guarantee you that it's almost certainly a pessimization over just using octets.
My current interpretation as to why, is that increasing L1D cache pressure with larger histograms is actually detrimental to overall performance.
There is perhaps a world with bigger caches and faster memory where going wider is better, but for now it seems eight bits reigns supreme.
Instead of applying the key-derivation function on each access, you could parameterize the sort on two functions; One to rewrite the input buffer into the desired key-derivation, and the other to transform the final output buffer back to the original form.
The savings are supposed to come from not having to apply the key-derivation function to each key during the intermediate sorting passes. In my testing a performance increase from this micro-optimization -- rewriting in the histogramming loop and the last sorting pass -- could not be measured even at 40 million keys. It's possible that this is worth it on less capable hardware.
It's also possible to change the underlying code instead of manipulating the key. You can reverse the sort-order by reversing the prefix sum, or simply reading the final result backwards if possible.
The 2000-era Radix Sort Revisited presents a direct way to change the code to handle floats.
Working primarily on IA-32 and AMD64/x86-64 CPUs, I've never had a good experience with
adding manual prefetching (i.e __builtin_prefetch
).
Michael Herf reports a "25% speedup" from adding _mm_prefetch
calls
to his code that was running on a Pentium 3.
I'll put this in the TBD column for now.
See "Prefix Sum with SIMD" for various approaches to using SIMD (AVX2 specifically) to calculate prefix sums.
There are optimized histogram functions out there, and with newer instruction set extensions it certainly seems more viable, but I have yet to personally explore this space.
For other architectures this may be more or less viable.
TODO: This code is very much a work in progress, used to test various techniques, and does NOT represent a final 'product'.
radix_sort.hpp radix_tests.cpp
By default we build an executable called radix
. This is a test harness of sorts, with some options
to let you test different setups.
$ ./radix
Usage: ./radix <count> [<use_mmap> <use_huge> <uint8_t|uint16_t|uint32_t|uint64_t|int32_t|int64_t|float|double> <hex-mask>]
The first argument is the number of integers to sort (from 40M_32bit_keys.dat
), with zero
meaning all:
$ ./radix 0
To sort all the available data as unsigned 16-bit integers using mmap
for memory allocation, and forcing a column skip:
$ ./radix 0 1 0 uint16_t 0x00ff
The second argument is a flag controlling the use of mmap
to map the input file and allocate memory. Defaults to 0
.
The third argument is a flag that controls the use of hugepages,
i.e very large memory pages. The presence of this feature is not guaranteed, and allocation may fail on some systems when
used in combination with mmap
. Defaults to 0
.
The fourth argument is the type
, i.e, what type to interpret the input file as an array of. Defaults to uint32_t
.
The fifth argument is a mask expressed in hexadecimal that will be bitwise AND
-ed against the input before it's
sorted. This can be used to demonstrate the column-skipping functionality, e.g by passing 0x00FFFFFF the MSD column
should be skipped. Defaults to no masking.
The bench
Make target will build and run a benchmark comparing the (WIP) C++ implementation against std::sort
and stdlib qsort
.
This code is built on top of and requires the prior installation of the Google benchmark microbenchmark support library,
which is available as libbenchmark-dev
on Debian and Ubuntu.
It's not worth talking about specific numbers at this point in time, but some general notes are in order.
You will notice that there's a cut-off point, usually after a few hundred or so keys, where radix sort starts beating the other algorithms. This is due to the overhead of initializing, generating and accessing histograms. We can always implement a hybrid which switches to a sort that handles small inputs better, giving us the best of both worlds.
The cost of allocating the auxiliary sorting buffer is not included in the timings by default. If you care about that the benchmark can be easily updated to move the aux buffer allocation into the benchmark loop. In practice, repeated allocation and deallocation of the same-sized block is likely to be immaterial, especially when there is little other pressure on the memory allocator during the benchmark.
Because we're sorting random data, the column-skipping optimization and pre-sorted detection is extremely unlikely to kick in, so while the benchmark is realistic, it is by no means a best-case scenario.
The main downside to out-of-place implementations is that the natural interface is different from "regular" in-place sorts. This becomes a problem when you need to conform to pre-existing interfaces in e.g programming language standard libraries and tools. You can work around this in various ways, but this either sacrifices performance, or complicates the implementation. Take the hit of allocating and deallocating the auxilary storage every call, or keep a lazily or pre-allocated, and possibly growing, block of memory around that can be reused.
They are also, as a general rule, not great for very small inputs.
Consider what it would mean if in a counting sort, instead of a typical 32- or 64-bit counter, we only had one bit per integer.
Instead of an array of counters, we have a bitmap. Instead of increasing a counter each time we see a value, we simply mark it off as present in the bitmap.
This would only work if there were zero or one occurrence of each possible input value. More correctly; it would only work if we're okay with removing any duplicates in the input, leaving only unique values.
static void bitmap_sort_16(uint16_t *arr, uint64_t *bitmap, size_t n)
{
// Mark integers as present in bitmap
for (size_t i = 0 ; i < n ; ++i) {
bitmap[arr[i] >> 6] |= (1UL << (arr[i] & 63));
}
}
Since we're using a single bit per value we have extend the magnitude of keys we can handle per unit memory by a factor of 32 or 64. We can now go up to 24-bit keys in only 2MiB memory.
Jon Bentley originally presented a scenario where this approach could be useful in his book Programming Pearls, 2nd ed, section 1.2, and his Dr.Dobb's column Algorithm Alley.
- More graphs, performance metrics.
- The other path: MSB Radix String Sorting
- Buc--ket sort
- Ame--rican Flag Sort
- Asserting on histogram counter overflow.
- Pierre Terdiman, "Radix Sort Revisited", 2000.
- Pierre Terdiman, "Radix Redux", 2008 (PDF).
- Michael Herf, "Radix Tricks, 2001.
Footnotes
-
As the wikipedia page explains, it's really the maximum difference of the keys involved that matters. Some implementations can be seen scanning the input data to determine and allocate just enough entries to fit either
max(key) + 1
, or tighter,max(key) - min(key) + 1
keys. However, unless you do this because you allocate memory on each sort call and just want to save some, you will most likely have to consider what to do if the input range is too wide to handle, which is not a good position to be in. In practice you would never want to fail on some inputs, which makes that type of implementation not very useful. ↩