Remove Compression & Decompression from RLPMemo #456

dipkakwani · 2024-12-26T06:26:29Z

Currently the memoised RLPs are stored in a compressed form when flushing to the PageDb, but when stored in-memory, these are in the decompressed form.

This PR removes compression & decompression altogether by always storing the memorized RLPs in compressed form. To achieve this, a NibbleSet index (2 bytes) is stored at the end of RLPMemo buffer which indicates whether each element is currently stored in the memo. For example, [k1 | k2 | k5] in RLPMemo would also store NibbleSet {1, 2, 5} or 0b0000 0000 0010 0110 to show that only 1, 2 & 5 index children's keccaks are currently memoized.

Also note that due to an existing optimization done during DB write operation to not store memoized RLPs for top level state branches (SkipRlpMemoizationForTopLevelsCount), the RLPMemo data structure supports storing empty buffer even if there are children for that branch.

Scooletz

I think I either don't follow or there's a mistake in the approach of removing the Compress.

Let's assume that we have a Branch that has the children with the following first nibbles {A, B, C}. How does this approach distinguish cases where:

A and B are encoded to Keccak while C RLP is less than 31 bytes
2 A and C are encoded to Keccak while B RLP is less than 31 bytes

Where is the information stored now?

src/Paprika/Merkle/ComputeMerkleBehavior.cs

src/Paprika/Merkle/RlpMemo.cs

src/Paprika/Merkle/ComputeMerkleBehavior.cs

Scooletz

A few remarks, but almost ready to merge.

src/Paprika/Merkle/RlpMemo.cs

Scooletz · 2025-01-06T09:21:03Z

src/Paprika/Merkle/RlpMemo.cs

        {
-            var bits = leftover[^emptyBytesLength..];
-            NibbleSet.Readonly.ReadFrom(bits, out empty);
+            var bits = _buffer[^indexLength..];


This might be a nitpicking, but if we stored the index at the beginning, no span slice would be required and we could get it from the start, just by passing the buffer.

I tried this particular change and for some reason it started breaking the CalculateStateRootHash. Tried bunch of ways to test/debug, but it wasn't clear what exactly was causing the test failure. Moving back to the original index at the end resolved the issues.
Do you have any pointers if that is expected? But either way, I prefer to keep the underlying data consistent with the original representation after Compress operation, by keeping the index at the end and not deal with state root mismatch issues 😓

As discussed offline, added new tests with grow/shrink which were able to catch the issue with header placement/incorrect implementation. Also modified the Large_random_operations test which caught failures too. The current coverage for RLPMemo class is 100%. All the test pass with the current implementation.

src/Paprika/Merkle/RlpMemo.cs

Scooletz · 2025-01-06T09:24:58Z

src/Paprika/Merkle/ComputeMerkleBehavior.cs

                    if (value.Length == Keccak.Size)
                    {
-                        memoizedUpdated = true;
-                        memo.SetRaw(value, i);
+                        if (memo.Exists(i))


I like the clear approach here, where different cases are split.

Scooletz · 2025-01-06T09:27:05Z

src/Paprika/Merkle/RlpMemo.cs

-        // Optimization, omitting some of the branches to memoize.
-        // It omits only these with two children where the cost of the recompute is not big.
-        // To prevent an attack of spawning multiple levels of such branches, only even are skipped
-        if (children.SetCount == 2 && (key.Path.Length + key.StoragePath.Length) % 2 == 0)


This shaved off some memory for small branches. It wasn't the biggest saving but it did something. Is it intentionally removed?

Good catch, I moved this code to InspectBeforeApply, since this optimization was done during compression, which we don't do now. But note that for this particular check, we now have to parse the branch data in InspectBeforeApply and then find out the children.SetCount value. Is the extra parsing in InspectBeforeApply still worth it to save off for small branches?

The reason why it was applied for small branches is exactly this. A small branch that has only 2 children will have 50% of its Keccaks recalculated on any update. Why bother with keeping the other if we can save 64 + 2 bytes for them? Could you have a db imported with this PR and provide the size used by it?

I didn't get you fully.. we can save up that space, but I just wanted to point out that here we are doing an additional bit of parsing to find out the children count. I think that it is fine to be done for every branch, since we would be saving 64+2 bytes potentially with the smaller branches.

We are. This is due to the nature of unmaterialized data that need to be reparsed on each passing. I cannot provide you with numbers (#286 has none), but saving 66 bytes out of 68 for small branches seem to be a good deal 👀

Scooletz · 2025-01-06T09:35:44Z

src/Paprika/Merkle/RlpMemo.cs

        {
-            var bits = leftover[^emptyBytesLength..];


A potential compression opportunity lost here. Let's consider a branch that has only one Keccak memoized in RlpMemo. In such a cache, we could encode this one on a single byte. Even more, we could encode up to 2 on a single byte. This would, for cases where only 1 or 2 keccaks are memoized, make them ligher by one byte. For 2 keccaks, it would be ~ 1.5% saving where for one it would be 3%. This would make it a bit more complicated though. Another option would be to treat the special case of 1 byte to keep the not-set (which would save some space for 15 and 14 set, but would be much less impactful size-wise I think). We could have it set in NibbleSet so that it's capable to writing/reading from/to 1-2 bytes instead of 2 only.

I know that original decompress/compress had no such feature

Paprika/src/Paprika/Merkle/RlpMemo.cs

Lines 171 to 176 in 61c326a

if (empty.SetCount > 0)

{

var dest = writeTo.Slice(at * Keccak.Size, NibbleSet.MaxByteSize);

new NibbleSet.Readonly(empty).WriteToWithLeftover(dest);

return at * Keccak.Size + NibbleSet.MaxByteSize;

}

but we use something similar for branches themselves

Paprika/src/Paprika/Merkle/Node.cs

Lines 405 to 416 in 61c326a

if (Children.SetCount == ChildCount2)

{

leftover[0] = (byte)(Children.BiggestNibbleSet | (Children.BiggestNibbleSet << NibblePath.NibbleShift));

}

else

{

Debug.Assert(Children.SetCount == ChildCount3);

// remove the smallest nibble set and then get the smallest

var mid = Children.Remove(Children.SmallestNibbleSet).SmallestNibbleSet;

leftover[0] = (byte)(mid | (Children.BiggestNibbleSet << NibblePath.NibbleShift));

}

That's a good point! I was also thinking of another approach and wanted to run through you to check which one might be better (before completing the code changes):
To maximum utilize one byte, it could store partial bitset index. We can basically divide the 2 bytes into two parts and utilize only 1 byte to represent the index for cases where multiple contiguous indices are set.

Original (2 bytes) : [0, 1, 2, .. 15]
Modified (1 Byte): [x, 0, 1, 2, 3, 4, 5, 6] or [x, 7, 8, 9, 10, 11, 12, 13]

x can be 0 or 1 depending on which half it is representing.

As long as many children fall in a contiguous range, we would be only needing 1 byte. Disadvantage to this approach :
a) even if 2 children exist in two different halves, we would have to fall back on 2 bytes of storage.
b) 14,15 children can never be represented with one byte

Does this make sense? Do you think it is better/worse than your proposed approach?

What an interesting idea! One would need to count the collisions and probabilities 😅

for 2 nibbles, picking up two from one of the ranges is 7/15.

for 3 nibbles, we get 1/5

for 4 nibbles, we get 1/13

As we're trying to optimize for the small branches (the big won't benefit from removing 1 byte) it's a bet. With the ncoding proposed before, we get 1 byte for nibbles and always 2 bytes for other cases. Here we optimize up to any number, but this is diminishing returns, right? With 4 Keccaks it'll be only 1% saved.

What do you think then?

I would like to take up this optimization in another PR, since it would require extra logic and even additional testing to ensure that it works fine.

Having this said, should this PR be no longer labeled with Breaking?

Yes, removed the label.

Scooletz · 2025-01-06T09:37:08Z

src/Paprika.Tests/Merkle/RlpMemoTests.cs

    }

    [Test]
-    public void All_children_set_with_all_Keccaks_set()
+    public void Random_delete()


The random tests are nice for fuzzying, but it would be beneficial to have some explicit A, B, C where A, B or A,C are set so that we cover what previously was missing. Or maybe I'm not seeing the tests like this but they do exist 👀

Yes, that test is missing but I am not quite sure how to capture this in the test - a child with Rlp < 31 bytes while others have keccaks?
I looked at existing tests to figure it out (e.g. Three_accounts_sharing_start_nibble, Branch_two_leafs), but I couldn't find a way to simulate this scenario:

[Test] public void Insert_get_operation() { var commit = new Commit(); const string key0 = "A0000001"; const string key1 = "B0000002"; const string key2 = "C0000003"; commit.Set(key0); commit.Set(key1); commit.Set(key2); NibbleSet.Readonly children = new NibbleSet(0xA, 0xB, 0xC); commit.SetBranch(Key.Merkle(NibblePath.Empty), children); var merkle = new ComputeMerkleBehavior(); merkle.BeforeCommit(commit, CacheBudget.Options.None.Build()); }

I am probably missing some obvious way to complete this test scenario 🫤

I think I'd replace the Branch based test with a direct KeccakMemo test maybe? Not sure if covering the compounded Branch + memo is doable in a nice way.

Added basic new tests in RLPMemo with fixed children. Branch + memo test couldn't be done without exposing few internal details as public, so sticking to independent unit tests.

github-actions · 2025-01-15T08:35:11Z

Package	Line Rate	Branch Rate	Health
Paprika	80%	76%	➖
Summary	80% (5040 / 6297)	76% (1755 / 2312)	➖

Minimum allowed line rate is 75%

Scooletz

A few more clarifications.

dipkakwani changed the title ~~Remove Compression/Decompression Overhead from RLPMemo~~ Remove Compression & Decompression from RLPMemo Dec 26, 2024

dipkakwani linked an issue Dec 27, 2024 that may be closed by this pull request

RlpMemo should work on compressed data #445

Open

Scooletz requested changes Jan 2, 2025

View reviewed changes

src/Paprika/Merkle/ComputeMerkleBehavior.cs Outdated Show resolved Hide resolved

src/Paprika/Merkle/RlpMemo.cs Outdated Show resolved Hide resolved

src/Paprika/Merkle/RlpMemo.cs Outdated Show resolved Hide resolved

src/Paprika/Merkle/ComputeMerkleBehavior.cs Outdated Show resolved Hide resolved

dipkakwani added 7 commits January 6, 2025 11:55

initial commit

23e0631

modify/add tests

5d67a57

minor changes

67797b2

minor fixes

14c0e65

add nibbleset in RlpMemo to remember the order

b4f2093

nit spacing fix

bcec7e1

merge with latest main

eeec2e7

dipkakwani force-pushed the rlpmemo-changes branch from 253eef4 to eeec2e7 Compare January 6, 2025 06:46

dipkakwani marked this pull request as ready for review January 6, 2025 08:42

Scooletz requested changes Jan 6, 2025

View reviewed changes

address comments, re-order header - breaking merkle computation tests

8587bf1

dipkakwani added the 💥Breaking The change introduces a storage breaking change. label Jan 13, 2025

dipkakwani requested a review from Scooletz January 14, 2025 08:08

dipkakwani added 3 commits January 14, 2025 19:48

improve test to catch failures

53dca76

minor whitespace fixes in test

241be55

passing test + index at end

abafbc9

dipkakwani force-pushed the rlpmemo-changes branch from 100d382 to abafbc9 Compare January 14, 2025 15:55

improve dictionary/memo comparison

d435533

Scooletz reviewed Jan 15, 2025

View reviewed changes

dipkakwani removed the 💥Breaking The change introduces a storage breaking change. label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Compression & Decompression from RLPMemo #456

Remove Compression & Decompression from RLPMemo #456

dipkakwani commented Dec 26, 2024 •

edited

Loading

Scooletz left a comment

Scooletz left a comment

Scooletz Jan 6, 2025

dipkakwani Jan 14, 2025

dipkakwani Jan 15, 2025

Scooletz Jan 15, 2025

Scooletz Jan 6, 2025

Scooletz Jan 6, 2025

dipkakwani Jan 14, 2025

Scooletz Jan 15, 2025 •

edited

Loading

dipkakwani Jan 15, 2025

Scooletz Jan 15, 2025

Scooletz Jan 6, 2025 •

edited

Loading

dipkakwani Jan 10, 2025

Scooletz Jan 10, 2025

dipkakwani Jan 14, 2025

Scooletz Jan 15, 2025

dipkakwani Jan 15, 2025

Scooletz Jan 6, 2025

dipkakwani Jan 10, 2025

Scooletz Jan 10, 2025

dipkakwani Jan 14, 2025

github-actions bot commented Jan 15, 2025

Scooletz left a comment

	if (empty.SetCount > 0)
	{
	var dest = writeTo.Slice(at * Keccak.Size, NibbleSet.MaxByteSize);
	new NibbleSet.Readonly(empty).WriteToWithLeftover(dest);
	return at * Keccak.Size + NibbleSet.MaxByteSize;
	}

	if (Children.SetCount == ChildCount2)
	{
	leftover[0] = (byte)(Children.BiggestNibbleSet \| (Children.BiggestNibbleSet << NibblePath.NibbleShift));
	}
	else
	{
	Debug.Assert(Children.SetCount == ChildCount3);

	// remove the smallest nibble set and then get the smallest
	var mid = Children.Remove(Children.SmallestNibbleSet).SmallestNibbleSet;
	leftover[0] = (byte)(mid \| (Children.BiggestNibbleSet << NibblePath.NibbleShift));
	}

Remove Compression & Decompression from RLPMemo #456

Are you sure you want to change the base?

Remove Compression & Decompression from RLPMemo #456

Conversation

dipkakwani commented Dec 26, 2024 • edited Loading

Scooletz left a comment

Choose a reason for hiding this comment

Scooletz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Scooletz Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Scooletz Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 15, 2025

Scooletz left a comment

Choose a reason for hiding this comment

dipkakwani commented Dec 26, 2024 •

edited

Loading

Scooletz Jan 15, 2025 •

edited

Loading

Scooletz Jan 6, 2025 •

edited

Loading