-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Jacob Murphy <[email protected]>
- Loading branch information
1 parent
89e136a
commit 81cf8a7
Showing
1 changed file
with
128 additions
and
49 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
81cf8a7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally figured out why this is so confusing to me. this describes all of the activities as saving/restoring protobufs. But Valkey doesn't know anything about protobufs and certainly doesn't know how to save/restore them. It may seem obvious, but to Valkey developers, RDB I/O has a set of functions that operate on a totally different set of datatypes, i.e., ValkeyModule_SaveString, ValkeyModule_SaveUnsigned, etc. not protobufs.
I think the description needs to written in the language of Valkey RDB I/O functions.
Also, I don't think it makes any sense for individual chunks of a supplemental section to be required to be encoded as protobufs. the entire point of the chunking is that you can't afford to store the uber-object as a protobuf, so the chunk is only a piece of what you're storing. Why force it to be a protobuf when we know that we're going to discard the protobuf-ness immediately? Totally unnecessary run-time overhead? Not to mention extra coding time, etc.
I can't quite tell if this proposal supports multiple supplemental data sections per RDB section? If not, I think that's a serious problem.
81cf8a7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for figuring out the disconnect. Let me see if I can add some clarity by closing the gap between the two. TL;DR - each protobuf is serialized as a single ValkeyModule_SaveString call
We need some EOF marker for the payload. Originally, I considered
VM_SaveString("")
as the EOF, but I was worried about accidentally emitting the EOF when saving. E.g. imagine you are serializing a key-value-pair dict and one of the values is an empty string, you might just iterate over the key-value pairs andVM_SaveString(key); VM_SaveString(val);
. DoingVM_SaveString(val)
could accidentally match the EOF marker in the case whereval==""
.With protobuf, I thought it was a creative way to be able to signal EOF by having a protobuf with an empty content at the end. "not present" is an explicit state different from "empty". Overhead per chunk would be <10 bytes (a byte for the field tag, a variable number of bytes for the length of the chunk). Alternatively if you have an idea for a better EOF marker, let me know and I'm happy to update it.
The goal with EOF is to not force precomputing the number of chunks ahead of time, which will simplify the saving logic. Otherwise, we need to know how much we are saving ahead of time to emit a length, which may be difficult for complicated data types.
81cf8a7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there are times when not having to pre-compute the number of things makes the code easier. For those cases, an EOF marker isn't a bad way to handle it. Conversely there are many cases where the pre-computation is trivial and having that value in hand greatly simplifies the code (like handling a vector -- a large vector probably REQUIRES you to pre-compute so that we're not encountering the dynamic vector-expansion penalty on large data structures)
I feel like we ought to be able to handle both cases without much complexity. I don't see why some clever encoding in the header can't solve both cases.
Suppose we have three kinds of headers.
We could easily use a single signed number for an opcode. Positive numbers are type 1, Negative numbers are type 2, zero is type three.
For easy-precomputation you use 1 which always has a length value as the next item followed by that many blobs.
For hard pre-computation you just keep dumping 2's into the output format followed by a 3. It would be illegal for the opcode of the 2's to change midstream.