perf: improve WABinary encoder and decoder (+3.7x faster)#2513
perf: improve WABinary encoder and decoder (+3.7x faster)#2513Santosl2 wants to merge 20 commits intoWhiskeySockets:masterfrom
Conversation
…e IDs" This reverts commit c03f9d8.
…and improve class structure
📝 WalkthroughWalkthroughToken storage refactors from object to Map structure across the constants file. Decode and encode modules are restructured into dedicated Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Thanks for opening this pull request and contributing to the project! The next step is for the maintainers to review your changes. If everything looks good, it will be approved and merged into the main branch. In the meantime, anyone in the community is encouraged to test this pull request and provide feedback. ✅ How to confirm it worksIf you’ve tested this PR, please comment below with: This helps us speed up the review and merge process. 📦 To test this PR locally:If you encounter any issues or have feedback, feel free to comment as well. |
…anced JID handling Co-authored-by: Copilot <copilot@github.com>
…ance Co-authored-by: Copilot <copilot@github.com>
|
@jlucaso1 or @purpshell can you run real stress test? |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/WABinary/constants.ts`:
- Around line 1295-1305: Restore the old exported symbol TOKEN_MAP as a
compatibility alias to the new TOKEN_MAP_REFACTOR so external callers and
BinaryNodeCodingOptions keep the same shape; add a re-export or alias that makes
TOKEN_MAP refer to the same Map instance as TOKEN_MAP_REFACTOR (e.g., export
TOKEN_MAP aliasing TOKEN_MAP_REFACTOR) near where TOKEN_MAP_REFACTOR is defined
so both names point to the identical Map.
In `@src/WABinary/decode.ts`:
- Around line 91-99: The readInt implementation in readInt(n, littleEndian) uses
bitwise << and | which coerce to signed 32-bit and can produce negative values
for 4-byte lengths (BINARY_32), allowing checkEOS to be bypassed; update readInt
to compute unsigned values (either by delegating to Buffer.readUIntBE/readUIntLE
when n <= 6 or by using arithmetic accumulation: val = val * 256 + byte) so
lengths remain unsigned, and keep references to checkEOS, readInt, next,
readBytes and readStringFromChars to verify callers still receive a non-negative
length. Ensure littleEndian handling is preserved and tests for 4-byte high-bit
cases are added.
In `@src/WABinary/encode.ts`:
- Around line 247-263: The attrCount computation currently increments for any
non-null/undefined attribute but encoder only serializes string values in the
later loop (see attrs, attrCount, encoder.writeListStart, and the string-write
loop), causing an incorrect list size when non-strings (e.g., numbers) are
present; fix by either (A) changing the counting loop to only increment when
typeof v === 'string', or (B) validate attrs before encoding and throw an error
if any attr value is non-string so encoder.writeListStart is sized correctly;
apply the same string-only rule consistently before writing tag/content.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: de6deb48-39ce-4157-948f-e2acd765e608
📒 Files selected for processing (3)
src/WABinary/constants.tssrc/WABinary/decode.tssrc/WABinary/encode.ts
| export const TOKEN_MAP_REFACTOR: Map<string, { dict?: number; index: number }> = new Map() | ||
|
|
||
| for (const [i, SINGLE_BYTE_TOKEN] of SINGLE_BYTE_TOKENS.entries()) { | ||
| TOKEN_MAP[SINGLE_BYTE_TOKEN] = { index: i } | ||
| TOKEN_MAP_REFACTOR.set(SINGLE_BYTE_TOKEN, { index: i }) | ||
| } | ||
|
|
||
| for (const [i, DOUBLE_BYTE_TOKEN] of DOUBLE_BYTE_TOKENS.entries()) { | ||
| for (const [j, element] of DOUBLE_BYTE_TOKEN.entries()) { | ||
| TOKEN_MAP[element] = { dict: i, index: j } | ||
| TOKEN_MAP_REFACTOR.set(element, { dict: i, index: j }) | ||
| } | ||
| } |
There was a problem hiding this comment.
Preserve the old TOKEN_MAP export for compatibility.
Removing TOKEN_MAP changes the public BinaryNodeCodingOptions shape and breaks existing imports/custom opts immediately. Unless this PR is intentionally shipping a semver break, please re-export the same Map under the old name.
Compatibility alias
export const TOKEN_MAP_REFACTOR: Map<string, { dict?: number; index: number }> = new Map()
+export const TOKEN_MAP = TOKEN_MAP_REFACTOR📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export const TOKEN_MAP_REFACTOR: Map<string, { dict?: number; index: number }> = new Map() | |
| for (const [i, SINGLE_BYTE_TOKEN] of SINGLE_BYTE_TOKENS.entries()) { | |
| TOKEN_MAP[SINGLE_BYTE_TOKEN] = { index: i } | |
| TOKEN_MAP_REFACTOR.set(SINGLE_BYTE_TOKEN, { index: i }) | |
| } | |
| for (const [i, DOUBLE_BYTE_TOKEN] of DOUBLE_BYTE_TOKENS.entries()) { | |
| for (const [j, element] of DOUBLE_BYTE_TOKEN.entries()) { | |
| TOKEN_MAP[element] = { dict: i, index: j } | |
| TOKEN_MAP_REFACTOR.set(element, { dict: i, index: j }) | |
| } | |
| } | |
| export const TOKEN_MAP_REFACTOR: Map<string, { dict?: number; index: number }> = new Map() | |
| export const TOKEN_MAP = TOKEN_MAP_REFACTOR | |
| for (const [i, SINGLE_BYTE_TOKEN] of SINGLE_BYTE_TOKENS.entries()) { | |
| TOKEN_MAP_REFACTOR.set(SINGLE_BYTE_TOKEN, { index: i }) | |
| } | |
| for (const [i, DOUBLE_BYTE_TOKEN] of DOUBLE_BYTE_TOKENS.entries()) { | |
| for (const [j, element] of DOUBLE_BYTE_TOKEN.entries()) { | |
| TOKEN_MAP_REFACTOR.set(element, { dict: i, index: j }) | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/WABinary/constants.ts` around lines 1295 - 1305, Restore the old exported
symbol TOKEN_MAP as a compatibility alias to the new TOKEN_MAP_REFACTOR so
external callers and BinaryNodeCodingOptions keep the same shape; add a
re-export or alias that makes TOKEN_MAP refer to the same Map instance as
TOKEN_MAP_REFACTOR (e.g., export TOKEN_MAP aliasing TOKEN_MAP_REFACTOR) near
where TOKEN_MAP_REFACTOR is defined so both names point to the identical Map.
| private readInt(n: number, littleEndian = false): number { | ||
| this.checkEOS(n) | ||
| let val = 0 | ||
| for (let i = 0; i < n; i++) { | ||
| const shift = littleEndian ? i : n - 1 - i | ||
| val |= next()! << (shift * 8) | ||
| val |= this.next() << (shift * 8) | ||
| } | ||
|
|
||
| return val |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In JavaScript, do bitwise shift/or operators coerce numbers to signed 32-bit integers, and does Node.js Buffer.readUIntBE(offset, 4)return an unsigned value up to0xffffffff?
💡 Result:
Yes, in JavaScript, bitwise shift and logical operators (such as <<, >>, >>>, &, |, ^, ~) coerce their operands to signed 32-bit integers using the ToInt32 abstract operation, which converts numbers to the range [-2^31, 2^31 - 1] in two's complement representation. This applies to all bitwise operators except the unsigned right shift (>>>) right operand, which uses ToUint32. Yes, Node.js Buffer.readUIntBE(offset, 4) reads 4 bytes from the buffer at the given offset and interprets them as an unsigned big-endian 32-bit integer, returning a Number value from 0 to 0xffffffff (4294967295). This is distinct from signed methods like readInt32BE and matches the behavior of the specialized readUInt32BE method. The returned value is a safe JavaScript Number since it fits within the safe integer range up to 2^53 - 1.
Citations:
- 1: https://devdoc.net/web/developer.mozilla.org/en-US/docs/JavaScript/Reference/Operators/Bitwise_Operators.html
- 2: https://thelinuxcode.com/javascript-bitwise-operators-a-practical-modern-guide/
- 3: https://www.w3schools.com/js/js_bitwise.asp
- 4: https://tc39.es/ecma262/pr/3770/multipage/abstract-operations.html
- 5: https://bun.com/reference/node/buffer/Buffer/readUIntBE
- 6: https://docs.deno.com/api/node/buffer/~/Buffer.readUInt32BE
🏁 Script executed:
# Check if the file exists and read the exact code
cat -n src/WABinary/decode.ts | sed -n '85,105p'Repository: WhiskeySockets/Baileys
Length of output: 680
🏁 Script executed:
# Find all usages of readInt(4) and BINARY_32
rg "readInt\(4\)|BINARY_32" -B 2 -A 2Repository: WhiskeySockets/Baileys
Length of output: 1171
🏁 Script executed:
# Inspect the checkEOS method and readBytes signature
ast-grep --pattern 'checkEOS($_, $_) { $$$ }' || rg "checkEOS|readBytes" -B 1 -A 3 | head -50Repository: WhiskeySockets/Baileys
Length of output: 2085
🏁 Script executed:
# Search for other bitwise operations on multiple bytes
rg "<<\s*\(" -B 3 -A 1 src/WABinary/decode.ts | head -40Repository: WhiskeySockets/Baileys
Length of output: 199
🏁 Script executed:
# Check for readInt usage with other byte counts
rg "readInt\(" -B 1 -A 1Repository: WhiskeySockets/Baileys
Length of output: 1108
Stop building 4-byte lengths with bitwise operators.
<< and | coerce to signed 32-bit ints. For any BINARY_32 length with the high bit set, readInt(4) becomes negative, so checkEOS() can be bypassed and the subsequent readBytes() / readStringFromChars() calls run with an invalid length. Switch to Buffer.readUIntBE/LE() or arithmetic accumulation.
Use Buffer's unsigned readers
private readInt(n: number, littleEndian = false): number {
this.checkEOS(n)
- let val = 0
- for (let i = 0; i < n; i++) {
- const shift = littleEndian ? i : n - 1 - i
- val |= this.next() << (shift * 8)
- }
-
- return val
+ const val = littleEndian
+ ? this.buffer.readUIntLE(this.index, n)
+ : this.buffer.readUIntBE(this.index, n)
+ this.index += n
+ return val
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| private readInt(n: number, littleEndian = false): number { | |
| this.checkEOS(n) | |
| let val = 0 | |
| for (let i = 0; i < n; i++) { | |
| const shift = littleEndian ? i : n - 1 - i | |
| val |= next()! << (shift * 8) | |
| val |= this.next() << (shift * 8) | |
| } | |
| return val | |
| private readInt(n: number, littleEndian = false): number { | |
| this.checkEOS(n) | |
| const val = littleEndian | |
| ? this.buffer.readUIntLE(this.index, n) | |
| : this.buffer.readUIntBE(this.index, n) | |
| this.index += n | |
| return val | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/WABinary/decode.ts` around lines 91 - 99, The readInt implementation in
readInt(n, littleEndian) uses bitwise << and | which coerce to signed 32-bit and
can produce negative values for 4-byte lengths (BINARY_32), allowing checkEOS to
be bypassed; update readInt to compute unsigned values (either by delegating to
Buffer.readUIntBE/readUIntLE when n <= 6 or by using arithmetic accumulation:
val = val * 256 + byte) so lengths remain unsigned, and keep references to
checkEOS, readInt, next, readBytes and readStringFromChars to verify callers
still receive a non-negative length. Ensure littleEndian handling is preserved
and tests for 4-byte high-bit cases are added.
| let attrCount = 0 | ||
| if (attrs) { | ||
| for (const k in attrs) { | ||
| const v = attrs[k] | ||
| if (v !== undefined && v !== null) attrCount++ | ||
| } | ||
| } | ||
|
|
||
| writeListStart(2 * validAttributes.length + 1 + (typeof content !== 'undefined' ? 1 : 0)) | ||
| writeString(tag) | ||
| encoder.writeListStart(2 * attrCount + 1 + (content !== undefined ? 1 : 0)) | ||
| encoder.writeString(tag) | ||
|
|
||
| for (const key of validAttributes) { | ||
| if (typeof attrs[key] === 'string') { | ||
| writeString(key) | ||
| writeString(attrs[key]) | ||
| if (attrs) { | ||
| for (const key in attrs) { | ||
| const val = attrs[key] | ||
| if (typeof val === 'string') { | ||
| encoder.writeString(key) | ||
| encoder.writeString(val) |
There was a problem hiding this comment.
Make attribute counting match attribute serialization.
Lines 247-252 count any non-null attribute, but Lines 259-264 only write string values. If a JS caller passes something like { foo: 1 }, the list header is too large and the rest of the node is encoded at the wrong offsets. Count only strings here, or throw before writing.
Guard non-string attrs before sizing the list
let attrCount = 0
if (attrs) {
- for (const k in attrs) {
- const v = attrs[k]
- if (v !== undefined && v !== null) attrCount++
+ for (const k in attrs) {
+ const v = attrs[k]
+ if (v === undefined || v === null) {
+ continue
+ }
+ if (typeof v !== 'string') {
+ throw new Error(`invalid attribute value for "${k}"`)
+ }
+ attrCount++
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let attrCount = 0 | |
| if (attrs) { | |
| for (const k in attrs) { | |
| const v = attrs[k] | |
| if (v !== undefined && v !== null) attrCount++ | |
| } | |
| } | |
| writeListStart(2 * validAttributes.length + 1 + (typeof content !== 'undefined' ? 1 : 0)) | |
| writeString(tag) | |
| encoder.writeListStart(2 * attrCount + 1 + (content !== undefined ? 1 : 0)) | |
| encoder.writeString(tag) | |
| for (const key of validAttributes) { | |
| if (typeof attrs[key] === 'string') { | |
| writeString(key) | |
| writeString(attrs[key]) | |
| if (attrs) { | |
| for (const key in attrs) { | |
| const val = attrs[key] | |
| if (typeof val === 'string') { | |
| encoder.writeString(key) | |
| encoder.writeString(val) | |
| let attrCount = 0 | |
| if (attrs) { | |
| for (const k in attrs) { | |
| const v = attrs[k] | |
| if (v === undefined || v === null) { | |
| continue | |
| } | |
| if (typeof v !== 'string') { | |
| throw new Error(`invalid attribute value for "${k}"`) | |
| } | |
| attrCount++ | |
| } | |
| } | |
| encoder.writeListStart(2 * attrCount + 1 + (content !== undefined ? 1 : 0)) | |
| encoder.writeString(tag) | |
| if (attrs) { | |
| for (const key in attrs) { | |
| const val = attrs[key] | |
| if (typeof val === 'string') { | |
| encoder.writeString(key) | |
| encoder.writeString(val) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/WABinary/encode.ts` around lines 247 - 263, The attrCount computation
currently increments for any non-null/undefined attribute but encoder only
serializes string values in the later loop (see attrs, attrCount,
encoder.writeListStart, and the string-write loop), causing an incorrect list
size when non-strings (e.g., numbers) are present; fix by either (A) changing
the counting loop to only increment when typeof v === 'string', or (B) validate
attrs before encoding and throw an error if any attr value is non-string so
encoder.writeListStart is sized correctly; apply the same string-only rule
consistently before writing tag/content.
There was a problem hiding this comment.
1 issue found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/WABinary/decode.ts">
<violation number="1" location="src/WABinary/decode.ts:96">
P1: `readInt(4)` uses `<<` and `|` which coerce operands to **signed** 32-bit integers. When a BINARY_32 length has its high bit set (byte ≥ 0x80 shifted left by 24), the result becomes negative. A negative length passes `checkEOS` (since `index + negative < buffer.length`) and then feeds an invalid value into `readBytes`/`readStringFromChars`. Use `Buffer.readUIntBE`/`readUIntLE` or arithmetic multiplication instead of bitwise shifts.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| for (let i = 0; i < n; i++) { | ||
| const shift = littleEndian ? i : n - 1 - i | ||
| val |= next()! << (shift * 8) | ||
| val |= this.next() << (shift * 8) |
There was a problem hiding this comment.
P1: readInt(4) uses << and | which coerce operands to signed 32-bit integers. When a BINARY_32 length has its high bit set (byte ≥ 0x80 shifted left by 24), the result becomes negative. A negative length passes checkEOS (since index + negative < buffer.length) and then feeds an invalid value into readBytes/readStringFromChars. Use Buffer.readUIntBE/readUIntLE or arithmetic multiplication instead of bitwise shifts.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/WABinary/decode.ts, line 96:
<comment>`readInt(4)` uses `<<` and `|` which coerce operands to **signed** 32-bit integers. When a BINARY_32 length has its high bit set (byte ≥ 0x80 shifted left by 24), the result becomes negative. A negative length passes `checkEOS` (since `index + negative < buffer.length`) and then feeds an invalid value into `readBytes`/`readStringFromChars`. Use `Buffer.readUIntBE`/`readUIntLE` or arithmetic multiplication instead of bitwise shifts.</comment>
<file context>
@@ -8,110 +8,120 @@ const inflatePromise = promisify(inflate)
for (let i = 0; i < n; i++) {
const shift = littleEndian ? i : n - 1 - i
- val |= next()! << (shift * 8)
+ val |= this.next() << (shift * 8)
}
</file context>

Improved Issue
Refactors the binary encoder/decoder to a class-based, allocation-efficient architecture with strict bounds checking, stable object shapes, and optimized buffer usage. Eliminates per-call closure allocation overhead, improves JIT optimization (inline caching), and reduces GC pressure under deep recursion.
Also introduces explicit JID decoding paths, correct empty string handling, and preallocated structures for predictable performance.
Core Changes
Architecture
ByteEncoder/ByteDecoderclassesMemory & Performance
→ minimizes reallocations and copies
Buffer.subarrayused instead of copies→ zero-copy slicing
readList)→ avoids push/resize overhead
→ eliminates GC churn in recursive trees
Token System
Map(TOKEN_MAP_REFACTOR)→ predictable O(1), no prototype pollution risk
Map#setEncoding Improvements
Bounds-checked writes (
ensure) for all operationsUnified integer writing (
writeUInt*)Correct empty string handling (
'' !== LIST_EMPTY)Packed encoding (nibble/hex):
Decoding Improvements
checkEOS) on every operationPerformance Rationale
Before (closure-based)
After (class-based)
Observed effect
Bench BEFORE
Bench After
Summary by CodeRabbit