Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce RFC for RDB format #28

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
264 changes: 264 additions & 0 deletions rfc/rdb-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
---
RFC: 23
Status: Proposed
---

# RDB Format for ValkeySearch

## Abstract

To support rolling upgrade and downgrade, ValkeySearch needs to start with a robust RDB format that will support forwards and backwards compatibility.

## Motivation

Our existing RDB format is a good start, but it also is fairly rigid and won't work for the foreseeable future. Here are some examples of cases which would be difficult to implement with forwards and backwards compatibility in the current format:

* Saving in-flight operations on index serialization
* Saving non-vector contents (Tag, Numeric)
* Saving "externalized" vector contents (those shared with the engine)
* Supporting changes to the serialized index contents
* Supporting new, large payloads alongside existing index payloads (i.e. chunked/streamed reading from RDB)

## Design considerations

- If no new features are used, we should support full forwards compatibility
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no new features are used, we should support full forwards compatibility

This statement has an implication that we shouldn't change the encoding outside of the new features and would preclude future occurrences of ziplist to listpack conversion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that the rfc is filed under the valkey search repo so I assume the scope is limited to this particular module? Or you are thinking about other modules or even perhaps RDB in general?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a thoughtful design of RDB format, we can support both backward and forward compatibility even when new features are introduced. For example, we can save data section by section. Each section is identified by a unique opcode. Each section's data format could be: opcode-count-blobs, where count means number of following blobs. There will be one loader function per opcode. Backward compatibility can be achieved by always maintaining the past opcodes and loader functions. Forward compatibility can be achieved by skipping the section whose opcode is unknown.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that the rfc is filed under the valkey search repo so I assume the scope is limited to this particular module? Or you are thinking about other modules or even perhaps RDB in general?

Right - this is just for ValkeySearch. Valkey engine RDB format is not included in this proposal.

This statement has an implication that we shouldn't change the encoding outside of the new features

Right - I think for ValkeySearch this will be the default stance. If we need to change the encoding in a way that would prevent clean downgrade - we will have to bump semantic versioning.

- If we cannot support full forwards compatibility, we will need to bump the major version in order to follow semantic versioning
- Downgrade across semantic versions should be supported, but may require index rebuild
- If new features are used or the encoding of the index has changed, we should ensure this lack of compatibility is understood on previous versions, and fast fail the RDB load
- A configuration should be exposed to toggle marking index contents as required in order to support downgrade-with-rebuild
- For upgrade, we need full backwards compatibility in all situations
- Large payloads that may not fit in memory need to be supported via chunking the contents into the RDB

## Specification

### RDB Integration

- We will store all module data in aux metadata to prevent leaking indexes into the keyspace
- We will use `aux_save2` style callbacks, so that if the module is completely unused, the RDB can be loaded on a Valkey server that has no ValkeySearch module installed.
- We will store all schema contents in a single piece of aux metadata for the module type "SchMgr-VS".

### RDB Payload

Below is a diagram of the proposed payload design:

```
Example Binary Dump
┌─────────────────────────────────────┐
Example Header │ Chunk 1 Chunk 2 Chunk 3 EOF │
┌──────────────────────────┐ │┌───────┐┌───────┐┌───────┐┌────────┐│
Unknown types│ Type Header │ ││... ││... ││... ││ ││
are skipped │ (enum) Content │ ┌──────────│└───────┘└───────┘└───────┘└────────┘│
│ │ ┌───────┐┌──────────────┐│ │ └─────────────────────────────────────┘
│ │ │ Index ││E.g, attribute││ │
└───────┼►│Content││name... ││ ┌──────────────────┼────────────────────────────┐
│ └───────┘└──────────────┘│ │ Header Binary │ Header Binary │
└──────────────────────────┘ │ Proto 1 Dump 1 ▼ Proto 2 Dump 2 │
│ │┌────────┐┌───────────┐┌────────┐┌───────────┐ │
└─────────►│ ││ ... ││ ││ ... │ │
│└────────┘└───────────┘└────────┘└───────────┘ │
└───────────────────────────────────────────────┘
Example Supplemental Content
RDB Aux Section │
┌─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┐
│ Module Type OpCode When Private Module Data │ │
│┌───────────┐┌────┐┌────┐┌───────────────────────────────────┼───────────────────────────────────────────────────────┐│
││ ││ ││ ││ Min. Sem. Section VSRDBSection │ Supplemental VSRDBSection Supplemental ││
││ ││ ││ ││ Version Count Proto Payload 1 │ Content for #1 Proto Payload 2 Content for #2 ││
││"SchMgr-VS"││ 2 ││ 2 ││┌────────┐┌─────┐┌───────────────┐┌▼─────────────────┐┌───────────────┐┌──────────────────┐││
││ ││ ││ │││ 1.0.0 ││ 2 ││ ││ ││ ││ │││
││ ││ ││ ││└────────┘└─────┘└──────▲────────┘└──────────────────┘└───────────────┘└──────────────────┘││
│└───────────┘└────┘└────┘└────────────────────────┼──────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────┘
Example VSRDBSection
┌───────────────────────────────────┐
│ Type Supplemental │
│ (enum) Count │
│ ┌────────┐┌─────────┐┌──────────┐ │
┌─┼─► Schema ││<content>││ 2 │ │
│ │ └────────┘└─────────┘└──────────┘ │
│ └───────────────────────────────────┘
Unknown types
are skipped
```

#### RDBSection

The primary unit of the RDB Payload is the RDBSection, which will have a proto definition similar to:

```
enum RDBSectionType {
RDB_SECTION_INDEX_SCHEMA,
...
}

message RDBSection {
RDBSectionType type = 1;
oneof contents {
IndexSchema index_schema_definition = 2;
...
};
uint32 supplemental_count = 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be uint64?

}
```

(`message IndexSchema` follows from the existing definition in src/index_schema.proto)

Before the `RDBSection`s, a single integer will be emitted using `ValekyModule_SaveUnsigned` to denote the number of `RDBSection`s emitted. Each `RDBSection` is then saved into the RDB using a single `ValkeyModule_SaveString` API call.

The goal of breaking into sections is to support skipping optional sections if they are not understood. New sections should be introduced in a manner where failure to understand the new section will generally load fine without loss. Any time that failure to load the section would result in some form of lossiness or inconsistency, we will need to bump the minimum semantic version to the minimum version that will be capable of reading it. When loading an RDB, if the semantic version is greater than our version, we can fail fast with a message. This would be desired in cases where operators have used new features and therefore need to think about the downgrade more critically, potentially removing (or, once supported, altering) indexes that will not downgrade gracefully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't the advantage in differentiating between protobuf content and supplemental content when it's trivially easy to have one uber-format that easily supports both types of data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could flatten the two - but my assumption here are that 90% of our data will belong to the index schema. What I like about the current proposal is that adding new sections to the index schema is supported natively by adding supplemental content sections. If we flatten this and have everything at the top level, we need some way to map payloads to index schemas. It's not hard to just dump the index schema name and DB number (which will compose a unique key for us to lookup the index schema), but I feel like given the majority of changes are likley to be part of the index schema contents, having first-class support of composition in the RDB format through supplemental sections will reduce the complexity.

An example might help. To me, the nested example seems less complex. But it may be a matter of preference:

Nested Flattened

RDBSection {
   type: RDB_SECTION_INDEX_SCHEMA
   required: true
   encoding_version: 1
   index_schema_definition: {
      name: "my_index",
      db_num: 0,
      ...
   }
   supplemental_count: 1
}
SupplementalContentHeader {
   type: SUPPLEMENTAL_CONTENT_MY_NEW_PAYLOAD
   required: ...
   encoding_version: 1
   my_new_payload_header: {}
}
SupplementalContentChunk {
   contents: "[my_new_payload_chunk_1]"
}
SupplementalContentChunk {
   contents: "[my_new_payload_chunk_2]"
}
SupplementalContentChunk {
   contents: "[my_new_payload_chunk_3]"
}
...
SupplementalContentChunk {
   contents: "" (EOF)
}

RDBSectionHeader {
   type: RDB_SECTION_INDEX_SCHEMA
   required: true
   encoding_version: 1
}
/* Count of RDB strings */
1
/* Serialized index schema proto */
"IndexSchema{name:\"my_index\", db_num: 0, ...}"
RDBSectionHeader {
   type: RDB_SECTION_MY_NEW_PAYLOAD
   required: ...
   encoding_version: 1
}
/* Count of RDB strings */
10
/* Reference to what index this belongs to */
"my_index"  /* name */
"0"         /* db_num */
/* Now the contents begin */
"[my_new_payload_chunk_1]"
"[my_new_payload_chunk_2]"
"[my_new_payload_chunk_3]"
...

#### Supplemental Contents

In addition to the contents in the `RDBSection`, we may have contents that are too large to serialize into a protocol buffer in memory. Namely, this would be contents such as "key to ID mappings" for the index, or the index contents itself. To support similar forwards and backwards compatibility characteristics as the RDBSection, we will include a supplemental count in the `RDBSection`, and upon load, we will begin loading supplemental contents following the load of the `RDBSection`.

Supplemental content is composed of a supplemental header (in protocol buffer format) and a raw chunked binary dump of the supplemental contents. The supplemental content header will be represented by a proto similar to:

```
enum SupplementalContentType {
SUPPLEMENTAL_CONTENT_INDEX_CONTENT,
SUPPLEMENTAL_CONTENT_KEY_TO_ID_MAP,
...
}

message IndexContentHeader {
Attribute attribute = 1;
}

message SupplementalContentHeader {
SupplementalContentType type = 1;
oneof header {
IndexContentHeader index_content_header = 2;
...
};
}
```

Each `SupplementalContentHeader` will be emitted using a single `ValkeyModule_SaveString` call, similar to the `RDBSection` proto above. Following the `SupplementalContentHeader`, the binary payload will be emitted in a chunked manner, each individually saved with a call to `ValkeyModule_SaveString`, terminated with an EOF marker. See below for more details on the dump format.

There may be additional non-chunked header contents needed by different supplemental content types as well, e.g. for index contents, we will need to know which attribute the index is corresponding to (represented by `IndexContentHeader` above).

#### Binary Dump

With the current Valkey RDB APIs, modules only have the ability to perform a complete read or write of a certain type to the RDB, there is no streaming capabilities. If the module were to attempt to write a gigabyte of data, it requires the full gigabyte to be serialized in memory, then passed the RDB APIs to save into the RDB.

To prevent memory overhead for large binary dumps, we will implement chunking of binary data to reduce the size of the individual RDB write API calls. We will use a simple protocol buffer with the following format to represent a chunk in a binary dump:

```
message SupplementalContentChunk {
bytes binary_content = 1;
}
```

Many `SupplementalContentChunk`s will follow a single `SupplementalContentHeader`. Each chunk will be emitted with a single call `ValkeyModule_SaveString`. Finally, after all `SupplementalContentChunk`s are written for the header, a single empty `SupplementalContentChunk` is emitted to denote EOF. We choose an EOF marker instead of a prefixed length, since precomputing the number of chunks for a large data type is difficult.

When a previous version sees a `SupplementalContentHeader` it doesn't understand and is optional, it will call `ValkeyModule_LoadString` and deserialize into a `SupplementalContentChunk`. The loading process can check the length of the content of the chunk and identify the EOF without understanding the contents of the dump itself. After EOF, the next item is either the next `SupplementalContentHeader`, or the next RDBSection if no more `SupplementalContentHeader`s exist for the current `RDBSection`.


### Semantic Versioning and Downgrade

Whenever the contents of the RDB are changed in a manner that an RDB could be produced on the new release that the immediate previous release could not load, we will need to bump the major version and add this to our release notes.

When we dump our RDB payload, we will also compute the minimum semantic version required to understand it. Typically, this will be a function of the feature usage and what version those features are introduced in. But it also leaves open the possibility to change the RDB format over time, either by breaking downgrade, or by employing strategies like skip versions (i.e. version 4.X.X may not understand 6.X.X, but 5.X.X can understand 6.X.X and output to 4.X.X).

If an RDB is loaded with a semantic version that is higher than the current version, the following log would be emitted and loading will fail:

```
ValkeySearch RDB contents require minimum version X.X.X. If you are downgrading, ensure all feature usage on the new version of ValkeySearch is supported by this version and retry.
```

### Coding Interface

#### SupplementalRDBWriter

The `SupplementalRDBWriter` interface provides methods to write and flush data.

##### `void setChunkSize(size_t chunk_size)`

Sets the default chunk size (i.e. chunk size if no calls to flush are made).

- **Parameters:**
- `chunk_size` (`size_t`): The size of the chunk, or if not specified, defaults to 1MiB.

##### `void write(absl::string_view buf)`

Writes a chunk of data.

- **Parameters:**
- `buf` (`absl::string_view`): The data buffer to write.
- **Behavior:**
- The function copies the provided buffer, and might flush it to the RDB.

##### `void flush()`

Flushes any buffered data.

- **Behavior:**
- Ensures that all written data is committed to the RDB.
- Contents are automatically flushed on destruction.

##### `void close()`

Closes the writer.

- **Behavior:**
- Emit an EOF marker to the RDB
- Also called implicitly on destruction

#### SupplementalRDBReader

The `SupplementalRDBReader` interface provides a method to read chunks of data.

##### `std::string read()`

Reads a chunk of data.

- **Returns:**
- `std::string`: A chunk of data that is read from the RDB.
- **Behavior:**
- Returns the next available chunk of data.
- If no data is available, the return value may be an empty `std::string`.

#### Example Usage

```

void IndexSchema::RDBSave() {
data_model::RDBSection section;
section.set_type(RdbSectionType::RDB_SECTION_INDEX_SCHEMA);
section.set_index_schema_definition(this->ToProto());
section.set_supplemental_count(this->GetAttributeCount() * 2);
ValkeyModule_RDBSaveString(section.SerializeAsString());

for (auto &attribute : attributes_) {
data_model::SuplementalContent index_content;
index_content.set_type(SUPPLEMENTAL_CONTENT_INDEX_CONTENT);
data_model::IndexContentHeader index_content_header;
index_content_header.set_attribute(attribute.second.ToProto());
index_content.set_index_content_header(index_content_header);
ValkeyModule_RDBSaveString(index_content.SerializeAsString());

auto index_writer = std::make_unique<SupplementalRDBWriter>();
attribute.second.GetIndex()->SaveIndex(std::move(index_writer));
/* SaveIndex writes all index contents as binary, then the writer is destructed once done */

data_model::SuplementalContent key_mapping;
key_mapping.set_type(SUPPLEMENTAL_CONTENT_KEY_TO_ID_MAP);
data_model::KeyToIdHeader key_to_id_header;
key_to_id_header.set_attribute(attribute.second.ToProto());
key_mapping.set_key_to_id_header(key_to_id_header);
ValkeyModule_RDBSaveString(key_mapping.SerializeAsString());

auto key_to_id_writer = std::make_unique<SupplementalRDBWriter>();
attribute.second.GetIndex()->SaveMapping(std::move(index_writer));
/* SaveMapping writes all key to id mappings as binary, then the writer is destructed once done */
}
}
```

### Testing

Cross version testing will be a must to ensure that we don't get this wrong. We should expand the existing integration tests with such functionality.