[RESP] Optimize command parsing with SIMD fast path and O(1) hash table lookup#1658
[RESP] Optimize command parsing with SIMD fast path and O(1) hash table lookup#1658
Conversation
Add RespCommandHashLookup: a cache-friendly O(1) hash table for RESP command name resolution. Uses hardware CRC32 (SSE4.2/ARM) for hashing, 32-byte cache-line-aligned entries, and linear probing within L1 cache. Key changes: - New RespCommandHashLookup.cs: static hash table (512 entries, 16KB) mapping uppercase command name bytes to RespCommand enum values - Per-parent subcommand hash tables for CLUSTER, CONFIG, CLIENT, ACL, COMMAND, SCRIPT, LATENCY, SLOWLOG, MODULE, PUBSUB, MEMORY, BITOP - ArrayParseCommand now uses hash lookup for primary commands instead of the ~950-line FastParseArrayCommand nested switch/if-else chains - BITOP pseudo-subcommands (AND/OR/XOR/NOT/DIFF) handled inline via dedicated ParseBitopSubcommand method with hash-based subcommand lookup - Subcommand dispatch (CLUSTER, CONFIG, etc.) falls through to existing SlowParseCommand for full backward compatibility - FastParseCommand hot path (GET, SET, PING, DEL) is completely untouched Performance: O(1) hash lookup (~10-12 cycles) replaces O(n) sequential comparisons (~30-300 cycles) for the long tail of ~170+ commands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add three optimization tiers to RESP command parsing:
Tier 1 - SIMD Vector128 FastParseCommand:
- 30 static Vector128<byte> patterns matching full RESP encoding (*N\r\n$L\r\nCMD\r\n)
- Single 16-byte load + masked comparison validates header + command in one op
- Covers top commands: GET, SET, DEL, TTL, PING, INCR, DECR, EXISTS, etc.
- Falls through to existing scalar ulong switch for variable-arg commands
Tier 2 - CRC32 hash table (RespCommandHashLookup):
- 512-entry cache-line-aligned table (16KB, L1-resident) with hardware CRC32 hash
- O(1) lookup for ~200 primary commands + 12 subcommand tables
- Replaces ~950-line FastParseArrayCommand nested switch/if-else chains
- BITOP pseudo-subcommands handled via dedicated ParseBitopSubcommand
Tier 3 - SlowParseCommand (existing):
- Subcommand dispatch for admin commands (CLUSTER, CONFIG, ACL, etc.)
Additional optimizations:
- HashLookupCommand uses GetCommand instead of GetUpperCaseCommand
(MakeUpperCase already uppercased the buffer, avoiding redundant work)
- TryParseCustomCommand moved after hash lookup (built-in commands
are far more common than custom extensions)
- FastParseCommand hot path preserved as scalar fallback for edge cases
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…add PING to parsing benchmark
Benchmarks ParseRespCommandBuffer directly to measure pure parsing throughput. Commands categorized by their position in the OLD parser: - Tier 1a SIMD: PING, GET, SET, INCR, EXISTS - Tier 1b Scalar: SETEX, EXPIRE - FastParseArrayCommand top: HSET, LPUSH, ZADD - FastParseArrayCommand deep: ZRANGEBYSCORE, ZREMRANGEBYSCORE, HINCRBYFLOAT - SlowParseCommand: SUBSCRIBE, GEORADIUS, SETIFMATCH Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2-entry MRU cache sits after SIMD patterns but before scalar switch in FastParseCommand. Caches the last 2 matched command patterns as Vector128 + mask, enabling 3-op cache hits for repeated Tier 1b/2 commands (HSET, LPUSH, ZADD etc.) that would otherwise fall through to the scalar switch or hash table. Cache is populated on successful ArrayParseCommand resolution and excludes: synthetic ParseRespCommandBuffer calls (ACL checks), subcommand results, and custom commands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…to badrishc/fast-parses
Eliminate SlowParseCommand for all subcommand routing. HandleSubcommandLookup now uses per-parent hash tables for CLUSTER, CLIENT, ACL, CONFIG, COMMAND, SCRIPT, LATENCY, SLOWLOG, MODULE, PUBSUB, MEMORY subcommands. Key fixes: - Fix CLUSTER SET-CONFIG-EPOCH hash entry (was SETCONFIGEPOCH, missing hyphens) - Handle edge cases: COMMAND with 0 args, case-insensitive GETKEYS/USAGE - Error message formatting: GenericErrUnknownSubCommand for CLUSTER/LATENCY, GenericErrUnknownSubCommandNoHelp for others - Remove writeErrorOnFailure guard from MRU cache (unnecessary) - Use consumedBytes (readHead - cmdStartOffset) for cache entry sizing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…le parsing Replace references to FastParseArrayCommand/SlowParseCommand with hash table instructions. New commands now just need one Add() call in PopulatePrimaryTable(). Document subcommand table wiring and warn about wire-protocol spelling (hyphens etc.). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fixes - Add Debug.Assert for command name length/positivity in hash table ops - Add startup ValidateSubTable: verifies every subcommand entry round-trips correctly through the hash table (catches typos like SET-CONFIG-EPOCH) - Clean up InsertIntoTable: remove redundant double-assignment of NameWord1/2, add explicit zero-init and clear comments on word layout contract - Fix comment in HashLookupCommand: document that MakeUpperCase only uppercases the first token, subcommands need GetUpperCaseCommand - Add comment documenting MRU cache zero-initialization safety Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR upgrades Garnet’s RESP command parsing pipeline by introducing SIMD-accelerated matching for the hottest commands, plus a cache-friendly hash table (with subcommand tables) to replace the previous deep switch/linear-scan parsing logic. It also adds benchmark coverage and updates contributor documentation to reflect the new recommended parsing extension points.
Changes:
- Added
RespCommandHashLookup(primary + subcommand hash tables) and integrated it intoArrayParseCommand. - Reworked
FastParseCommandto add SIMD Vector128 pattern matching and a per-session 2-slot MRU cache. - Added a dedicated BenchmarkDotNet benchmark for parser-only throughput and updated docs/guides for adding commands.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| website/docs/dev/fast-parsing-plan.md | Adds a detailed parsing optimization/design document. |
| libs/server/Resp/Parser/RespCommandHashLookup.cs | New static hash-table-based command/subcommand lookup implementation. |
| libs/server/Resp/Parser/RespCommand.cs | Integrates SIMD fast path + MRU cache + hash lookup parsing; removes legacy slow parsing paths. |
| benchmark/BDN.benchmark/Operations/CommandParsingBenchmark.cs | Adds parsing-only microbenchmarks across tiers. |
| .github/skills/add-garnet-command/SKILL.md | Updates contributor guidance to use the new hash lookup path. |
| .github/copilot-instructions.md | Updates “add parsing logic” instructions to reference the new hash lookup table. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add new commands from dev to RespCommandHashLookup: - Vector Set: VADD, VCARD, VDIM, VEMB, VGETATTR, VINFO, VISMEMBER, VLINKS, VRANDMEMBER, VREM, VSETATTR, VSIM - Range Index (dot-prefixed wire names): RI.CREATE, RI.SET, RI.GET, RI.DEL, RI.RANGE, RI.SCAN, RI.EXISTS, RI.CONFIG, RI.METRICS - String: SETWITHETAG - Cluster subcommands: rename AOFSYNC -> ADVANCE_TIME, add MLOG_KEY_TIME, RESERVE - Internal-only RIPROMOTE, RIRESTORE never come from wire (no hash entry needed)
Keeps FastParseCommand small enough for AggressiveInlining to take effect, reducing call overhead on the common short-buffer scalar path.
| private static readonly CommandEntry[] bitopSubTable; | ||
| private static readonly int bitopSubTableMask; | ||
|
|
||
| static RespCommandHashLookup() |
There was a problem hiding this comment.
Static initializers can cause the JIT to inject checks around static methods, since this is so hot let's use a module initializer instead.
Small example of code check diff:
WithStaticInit.Bar:
L0000 push rbx
L0001 sub rsp, 0x20
L0005 mov ebx, ecx
L0007 mov rcx, 0x7ffe7cccccc8
L0011 mov edx, 2
L0016 call 0x00007ffedade5e20
L001b mov rcx, 0x7ffe7cccccc8
L0025 mov edx, 2
L002a call 0x00007ffedae04ef0
L002f mov eax, [rax+8]
L0032 cdq
L0033 idiv ebx
L0035 mov eax, edx
L0037 add rsp, 0x20
L003b pop rbx
L003c ret WithModuleInit.Bar
L0000 push rbx
L0001 sub rsp, 0x20
L0005 mov ebx, ecx
L0007 mov rcx, 0x7ffe7cccccc8
L0011 mov edx, 3
L0016 call 0x00007ffedae04ef0
L001b mov eax, [rax+8]
L001e cdq
L001f idiv ebx
L0021 mov eax, edx
L0023 add rsp, 0x20
L0027 pop rbx
L0028 ret .NET 8, x64
| @@ -0,0 +1,503 @@ | |||
| // Copyright (c) Microsoft Corporation. | |||
There was a problem hiding this comment.
nit: there's a mix of explicit local types and var in here - let's prefer var everywhere in this change.
| // SIMD Vector128 patterns for FastParseCommand. | ||
| // Each encodes the full RESP header + command: *N\r\n$L\r\nCMD\r\n | ||
| // Masks zero out trailing bytes for patterns shorter than 16 bytes. | ||
| private static readonly Vector128<byte> s_mask13 = Vector128.Create( |
There was a problem hiding this comment.
These three .AsByte() calls are superfluous.
Description of Change
Replaces the two legacy command parsing methods (
FastParseArrayCommand~950 lines of nested switch/if chains,SlowParseCommand~815 lines of sequentialSequenceEqualcomparisons) with a tiered architecture:RespCommandHashLookup) — O(1) lookup for all400 built-in commands. 512-entry table (16KB, L1-resident) with linear probing and per-parent subcommand tables.Also unifies BITOP subcommand handling, adds debug assertions, hardens the hash table against edge cases (empty names, oversized names), and adds
ValidatePrimaryTable()for startup integrity checks.Key files:
RespCommand.cs— RefactoredFastParseCommand(SIMD + scalar + MRU), removedFastParseArrayCommandandSlowParseCommandRespCommandHashLookup.cs— New: Hash table engine (lookup, insert, validate, subcommand dispatch)RespCommandHashLookupData.cs— New: Command registration (PopulatePrimaryTable, subcommand arrays)RespCommandSimdPatterns.cs— New: SIMD Vector128 patterns andRespPattern()helperCommandParsingBenchmark.cs— New: BDN benchmark covering all parser tiersCommandParsingBenchmark Results (Params=None, batch of 100):