-
Notifications
You must be signed in to change notification settings - Fork 70
Blog post about new hash table #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
0f2ad74
f2f3f41
45d3d31
fccd669
82d0062
c43be66
bba211f
9ea87ae
b1864e7
f5463ad
758243c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| --- | ||
| title: Viktor Söderqvist | ||
| extra: | ||
| photo: '/assets/media/authors/zuiderkwast.jpeg' | ||
| github: zuiderkwast | ||
| --- | ||
|
|
||
| Viktor is an open source developer from Ericsson and one of the maintainers of Valkey. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,229 @@ | ||
| +++ | ||
| title= "A new hash table" | ||
| date= 2025-03-28 00:00:00 | ||
| description= "Designing a state-of-the art hash table" | ||
| authors= ["zuiderkwast"] | ||
| +++ | ||
|
|
||
| Many workloads are bound on storing data. Being able to store more data using | ||
| less memory allows you to reduce the size of your clusters. | ||
|
|
||
| In Valkey, keys and values are stored in what's called a hash table. A hash | ||
| table works by chopping a key into a number of seemingly random bits. These bits | ||
| are shaped into a memory address, pointing to where the value is supposed to be | ||
| stored. It's a very fast way of jumping directly to the right place in memory | ||
| without scanning trough all the keys. | ||
|
|
||
| For the 8.1 release, we looked into improving the performance and memory usage, | ||
| so that users can store more data using less memory. This work led us to the | ||
| design of a new hash table, but first, let's take a look at the hash table that | ||
| was used in Valkey until now. | ||
|
|
||
| The dict | ||
| -------- | ||
|
|
||
| The hash table used Valkey until now, called "dict", has the following memory | ||
| layout: | ||
|
|
||
| <!--  --> | ||
|
|
||
| ``` | ||
| +---------+ | ||
| | dict | table | ||
| +---------+ +-----+-----+-----+-----+-----+-----+----- | ||
| | table 0 ----->| x | x | x | x | x | x | ... | ||
| | table 1 | +-----+-----+--|--+-----+-----+-----+----- | ||
| +---------+ | | ||
| v | ||
| +-----------+ +-------+ | ||
| | dictEntry | .--->| "FOO" | | ||
| +-----------+ / +-------+ | ||
| | key -----' | ||
| | | +-------------------+ | ||
| | value ----------->| serverObject | | ||
| | | +-------------------+ | ||
| | next | | type, encoding, | | ||
| +-----|-----+ | ref-counter, etc. | | ||
| | | "BAR" (embedded) | | ||
| v +-------------------+ | ||
| +-----------+ | ||
| | dictEntry | | ||
| +-----------+ | ||
| | key | | ||
| | value | | ||
| | next | | ||
| +-----|-----+ | ||
| | | ||
| v | ||
| ... | ||
| ``` | ||
|
|
||
| The dict has two tables, called "table 0" and "table 1". Usually only one | ||
| exists, but both are used when incremental rehashing is in progress. | ||
|
|
||
| It's a chained hash table, so if multiple keys are hashed to the same slot in | ||
| the table, their key-value entries form a linked list. That's what the "next" | ||
| pointer in the `dictEntry` is for. | ||
|
|
||
| To lookup a key "FOO" and access the value "BAR", Valkey still has to read from | ||
| memory four times. If there is a hash collission, it has to follow two more | ||
zuiderkwast marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| pointers for each hash collission and thus read twice more from memory (the key | ||
zuiderkwast marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| and the next pointer). | ||
|
|
||
| Minimize memory accesses | ||
| ------------------------ | ||
|
|
||
| One of the slower operations when looking up a key-value pair is reading from | ||
| the main RAM memory. A key point is therefore to make sure we have as few memory | ||
| accesses as possible. Ideally, the memory we want to access should already | ||
| stored in the CPU cache, which is a smaller but much faster memory that belong | ||
| to the CPU. | ||
|
|
||
| Optimizing for memory usage, we also want to minimize the number of distinct | ||
| memory allocations and the number of pointers between them, because storing a | ||
| pointer needs 8 bytes in a 64-bit system. If we can save one pointer per | ||
| key-value pair, for 100 million keys that's almost a gigabyte. | ||
|
|
||
| When the CPU loads some data from the main memory into the CPU cache, it does so | ||
| in fixed size blocks called cache lines. The cache-line size is 64 bytes on | ||
| almost all modern hardware. Recent work on hash tables, such as [Swiss | ||
| tables](https://abseil.io/about/design/swisstables), are highly optimized to | ||
| store and access data within a single cache line. If the key you're not looking | ||
| for isn't found where you first look for it (due to a hash collision), then it | ||
| should ideally be found within the same cache line. If it is, then it's found | ||
| very fast once this cache line has been loaded into the CPU cache. | ||
|
|
||
| Required features | ||
| ----------------- | ||
|
|
||
| Why not use an open-source state-of-the-art hash table implementation such as | ||
| Swiss tables? The answer is that we require some specific features, apart from | ||
| the basic operations like add, lookup, replace and delete: | ||
|
|
||
| * Incremental rehashing, so that when the hashtable is full, we don't freeze the | ||
| server while the table is being resized. | ||
|
|
||
| * Scan, a way to iterate over the hash table even if the hash table is resized | ||
| between the iterations. This is important to keep supporting the | ||
| [SCAN](/commands/scan/) command. | ||
|
|
||
| * Random element sampling, for commands like [RANDOMKEY](/commands/randomkey/). | ||
|
|
||
| These aren't standard features, so we couldn't simply pick an off-the-shelf hash | ||
| table. We had to design one ourselves. | ||
|
|
||
| Design | ||
| ------ | ||
|
|
||
| In the new hash table designed for Valkey 8.1, the table consists of buckets of | ||
| 64 bytes, one cache line. Each bucket can store up to seven elements. Keys that | ||
| map to the same bucket are all stored in the same bucket. The bucket also | ||
| contains a metadata section, marked "m" in the figures. The bucket layout | ||
| including the metadata section is explained in more detail below. | ||
|
|
||
| We've eliminated the `dictEntry` and instead embed key and value in the | ||
| `serverObject`, along with other data for the key. | ||
|
|
||
| ``` | ||
| +-----------+ | ||
| | hashtable | bucket bucket bucket | ||
| +-----------+ +-----------------+-----------------+-----------------+----- | ||
| | table 0 ------>| m x x x x x x x | m x x x x x x x | m x x x x x x x | ... | ||
| | table 1 | +-----------------+-----|-----------+-----------------+----- | ||
| +-----------+ | | ||
| v | ||
| +------------------------+ | ||
| | serverObject | | ||
| +------------------------+ | ||
| | type, encoding, LRU, | | ||
| | ref-counter, etc. | | ||
| | "FOO" (embedded key) | | ||
| | "BAR" (embedded value) | | ||
| +------------------------+ | ||
| ``` | ||
|
|
||
| Assuming the `hashtable` structure is already in the CPU cache, looking up | ||
| key-value entry now requires only two memory lookups: The bucket and the | ||
| `serverObject`. If there is a hash collision, the object we're looking for is | ||
| most likely in the same bucket, so no extra memory access is required. | ||
|
|
||
| If a bucket becomes full, the last element slot in the bucket is replaced by a | ||
| pointer to a child bucket. A child bucket has the same layout as a regular | ||
| bucket, but it's a separate allocation. The length of these bucket chains are | ||
madolson marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| not bounded, but long chains are very rare as long as keys are well distributed | ||
| by the hashing function. Most of the keys are stored in top-level buckets. | ||
|
|
||
| ``` | ||
| +-----------+ | ||
| | hashtable | bucket bucket bucket | ||
| +-----------+ +-----------------+-----------------+-----------------+----- | ||
| | table 0 ---->| m x x x x x x x | m x x x x x x c | m x x x x x x x | ... | ||
| | table 1 | +-----------------+---------------|-+-----------------+----- | ||
| +-----------+ | | ||
| child bucket v | ||
| +-----------------+ | ||
| | m x x x x x x c | | ||
| +---------------|-+ | ||
| | | ||
| child bucket v | ||
| +-----------------+ | ||
| | m x x x x x x x | | ||
| +-----------------+ | ||
| ``` | ||
|
|
||
| The elements in the same bucket, or bucket chain, are stored without any | ||
| internal ordering. When inserting a new entry into the bucket, any of the free | ||
| slots can be used. | ||
|
|
||
| As mentioned earlier, each bucket also contains a metadata section. The bucket | ||
| metadata consists of eight bytes of which one bit indicates whether the bucket | ||
| has a child bucket or not. The next seven bits, one bit for each of the seven | ||
| element slots, indicates whether that slot is filled, i.e. whether it contains | ||
| an element or not. The remaining seven bytes are used for storing a one byte | ||
| secondary hash for each of the entries stored in the bucket. | ||
|
|
||
|  | ||
|
|
||
| The secondary hash is made up of hash bits that are not used when looking up the | ||
| bucket. Out of a 64 bits hash, we need not more than 56 bits for looking up the | ||
| bucket and we use the remaining 8 bits as the secondary hash. These hash bits | ||
| are used for quickly eliminating mismatching entries when looking up a key | ||
| without comparing the keys. Comparing the keys of each entry in the bucket would | ||
| require an extra memory access per entry. If the secondary hash mismatches the | ||
| key we're looking for, we can immediately skip that entry. The chance of a false | ||
| positive, meaning an entry for which the secondary hash is matching although the | ||
| entry doesn't match the key were looking for, is one in 256, so this eliminates | ||
| 99.6% of the false positives. | ||
|
|
||
| Results | ||
| ------- | ||
|
|
||
| By replacing the hash table with a different implementation, we've managed to | ||
| reduce the memory usage by roughly 20 bytes per key-value pair. | ||
|
|
||
| The graph below shows the memory overhead for different value sizes. The | ||
| overhead is the memory usage excluding the key and the value itself. Lower is | ||
| better. The zigzag pattern is because of aliasing between the datapoint spacing | ||
| and the memory allocator's discrete allocation sizes. | ||
|
||
|
|
||
|  | ||
|
|
||
| For keys with an [expire time](/commands/expire/) (time-to-live, TTL) the memory | ||
| usage is down even more, roughly 30 bytes per key-value pair. | ||
|
|
||
|  | ||
|
|
||
| In some workloads, such as when storing very small objects and when pipelining | ||
| is used extensively, the latency and CPU usage are also improved. In most cases | ||
| though this is negligble in practice. The key takeaway appears to be reduced | ||
madolson marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| memory usage. | ||
|
|
||
| Hashes, sets and sorted sets | ||
| ---------------------------- | ||
|
|
||
| The nested data types Hashes, Sets and Sorted sets also make use of the new hash | ||
| table when they contain a sufficiently large number of elements. The memory | ||
| usage is down by roughly 10-20 bytes per element for these types of keys. | ||
|
|
||
| Special thanks to Rain Valentine for the graphs and for the help with | ||
| integrating this hash table into Valkey. | ||
Uh oh!
There was an error while loading. Please reload this page.