Skip to content

Add version byte to ClusterConfig and ReplicationHistory serialization#1778

Merged
vazois merged 16 commits intodevfrom
vazois/cluster-versioning
May 8, 2026
Merged

Add version byte to ClusterConfig and ReplicationHistory serialization#1778
vazois merged 16 commits intodevfrom
vazois/cluster-versioning

Conversation

@vazois
Copy link
Copy Markdown
Contributor

@vazois vazois commented May 7, 2026

Summary

Add serialization versioning to ClusterConfig and ReplicationHistory binary formats using a version byte header to enable safe detection of incompatible payloads during gossip and disk recovery.


Motivation

Without version markers, nodes running different Garnet versions could exchange incompatible gossip payloads leading to deserialization failures or silent data corruption. The version byte enables:

  • Early rejection of incompatible gossip messages with clear warnings
  • Safe disk recovery with graceful fallback on format changes

Changes

1️⃣ ClusterConfig Versioning

File Change
libs/cluster/Server/ClusterConfig.cs Added ClusterConfigVersion constant (version 1, byte)
libs/cluster/Server/ClusterConfigSerializer.cs Version byte as first byte in ToByteArray/FromByteArray; added TryPeekVersion helper; fixed SerializeSlotMap to use relative stream position
libs/cluster/Session/RespClusterBasicCommands.cs Version check in NetworkClusterGossip (receive path)
libs/cluster/Server/Gossip/GarnetServerNode.cs Version check in GossipAsync (response path)
libs/cluster/Server/Gossip/Gossip.cs Version check in TryMeetAsync (meet response path)
libs/cluster/Server/Failover/ReplicaFailoverSession.cs Version check in failover gossip response path

2️⃣ ReplicationHistory Versioning

File Change
libs/cluster/Server/Replication/ReplicationHistoryManager.cs Added ReplicationHistoryVersion constant (version 1, byte); version byte in ToByteArray/FromByteArray; graceful recovery on mismatch (reinitializes fresh state)

Note: ReplicationHistory is only persisted to/from local disk (not exchanged over gossip), so recovery simply reinitializes — the node will re-negotiate with its primary on next sync.

3️⃣ Tests

Test Description
ClusterConfigVersionRoundTripTest Round-trip serialization preserves version header and data
ClusterConfigVersionMismatchThrowsTest Corrupt version byte → InvalidDataException
ClusterConfigTryPeekVersionEmptyDataTest Empty payload → TryPeekVersion returns false
ReplicationHistoryVersionRoundTripTest Round-trip serialization preserves version and fields
ReplicationHistoryVersionMismatchThrowsTest Corrupt version byte → InvalidDataException

All deserialization paths covered

Call site Guard
ClusterManager.cs (disk recovery) FromByteArray throws InvalidDataException
NetworkClusterGossip (gossip receive) TryPeekVersion pre-check
GarnetServerNode.GossipAsync (gossip response) TryPeekVersion pre-check
Gossip.TryMeetAsync (meet response) TryPeekVersion pre-check
ReplicaFailoverSession (failover gossip) TryPeekVersion pre-check
RecoverReplicationHistory (disk recovery) ✅ Graceful catch + reinitialize

vazois and others added 2 commits May 7, 2026 12:02
Add ClusterConfigVersion constant (version 1) to ClusterConfig that is
serialized as the first byte of the config payload. On deserialization,
the version is validated and an InvalidDataException is thrown on
mismatch.

All gossip paths now peek the version byte before full deserialization
to reject incompatible payloads early with a warning:
- NetworkClusterGossip (receive path)
- GarnetServerNode.GossipAsync (response path)
- TryMeetAsync caller (meet response path)
- ReplicaFailoverSession (failover gossip response path)

Also fix SerializeSlotMap to use relative stream position instead of
hardcoded position 0 for the segment count header.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add ReplicationHistoryVersion constant (version 1) to ReplicationHistory
that is serialized as the first byte of the payload. On deserialization,
the version is validated and an InvalidDataException is thrown on
mismatch.

RecoverReplicationHistory handles version mismatches gracefully by
catching the exception, logging a warning, and reinitializing fresh
state. This is safe because replication history is re-negotiated on
the next primary/replica sync.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 7, 2026 20:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces explicit serialization version markers for ClusterConfig and ReplicationHistory to detect incompatible payloads during cluster gossip and replication-history disk recovery, and updates gossip/meet/failover paths to validate versions before deserializing.

Changes:

  • Add a leading version byte to ClusterConfig and ReplicationHistory binary formats, with validation on deserialization and a TryPeekVersion fast-path for ClusterConfig.
  • Add version checks to gossip receive/response paths (including MEET and failover gossip responses) to reject incompatible payloads early with warnings.
  • Fix ClusterConfig.SerializeSlotMap to write the segment-count header at the correct reserved stream position (relative to current stream offset, not hardcoded to 0).
  • Add new cluster tests covering round-trip, version mismatch, and empty payload for TryPeekVersion.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/Garnet.test.cluster/ClusterConfigTests.cs Adds tests for versioned ClusterConfig/ReplicationHistory serialization behavior.
libs/cluster/Session/RespClusterBasicCommands.cs Validates ClusterConfig version before gossip deserialization on the network receive path.
libs/cluster/Server/Replication/ReplicationHistoryManager.cs Adds versioning to replication history serialization and attempts graceful recovery on version mismatch.
libs/cluster/Server/Gossip/Gossip.cs Validates MEET response version before deserializing/merging cluster config.
libs/cluster/Server/Gossip/GarnetServerNode.cs Validates gossip response version before deserializing/merging.
libs/cluster/Server/Failover/ReplicaFailoverSession.cs Validates failover gossip response version before deserializing/merging.
libs/cluster/Server/ClusterConfigSerializer.cs Implements TryPeekVersion, writes/validates version byte, and fixes slot-map segment-count header position.
libs/cluster/Server/ClusterConfig.cs Defines ClusterConfigVersion constant.
Comments suppressed due to low confidence (1)

libs/cluster/Server/Gossip/GarnetServerNode.cs:205

  • This deserializes the same returnedConfigArray twice (once into 'other' and again inside TryMerge). Reuse the already-deserialized 'other' when merging to avoid extra allocations and parsing on the gossip hot path.
                    var other = ClusterConfig.FromByteArray(returnedConfigArray);
                    var current = clusterProvider.clusterManager.CurrentConfig;
                    // Check if gossip is from a node that is known and trusted before merging
                    if (current.IsKnown(other.LocalNodeId))
                        clusterProvider.clusterManager.TryMerge(ClusterConfig.FromByteArray(returnedConfigArray));
                    else

Comment thread libs/cluster/Server/ClusterConfigSerializer.cs
Comment thread libs/cluster/Server/ClusterConfigSerializer.cs
Comment thread libs/cluster/Server/Gossip/Gossip.cs
Comment thread libs/cluster/Server/Failover/ReplicaFailoverSession.cs Outdated
Comment thread libs/cluster/Server/Replication/ReplicationHistoryManager.cs Outdated
Comment thread libs/cluster/Server/Replication/ReplicationHistoryManager.cs
vazois and others added 8 commits May 7, 2026 13:53
Replace the single leading version byte with a 2-byte magic prefix
followed by the version byte. This eliminates ambiguity with legacy
(pre-version) payloads:

- ClusterConfig uses 'GC' (0x47 0x43) magic. As a little-endian
  UInt16 this equals 18243, which exceeds the maximum possible legacy
  segmentCount (16384), so it can never collide with an old payload.
- ReplicationHistory uses 'GR' (0x47 0x52) magic, distinguishable
  from the legacy format which starts with a 7-bit encoded string
  length.

TryPeekVersion now validates both magic bytes before extracting the
version, ensuring legacy payloads are reliably rejected rather than
silently misinterpreted.

Add ClusterConfigLegacyPayloadRejectedTest to verify old-format
payloads are properly rejected by both TryPeekVersion and
FromByteArray.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a MEET response has an incompatible config version, dispose the
newly-created GarnetServerNode to avoid leaking sockets/resources,
and count the meet request as failed in gossip stats.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Catch EndOfStreamException and IOException in addition to
InvalidDataException during replication history recovery. This
handles truncated or corrupted on-disk payloads gracefully by
reinitializing fresh state instead of crashing the server.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass the already-deserialized ClusterConfig object directly to
TryMerge instead of calling FromByteArray a second time. Fixes
both ReplicaFailoverSession and GarnetServerNode.GossipAsync.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace explicit Encoding.ASCII with the default (UTF-8) in
ClusterConfigSerializer and ReplicationHistory BinaryWriters.

The encoding parameter only affects string serialization (ReadString/
Write(string)), not integer types. ASCII silently replaces non-ASCII
characters with '?' causing data corruption. The BinaryReader in
ClusterConfigSerializer was already using the default UTF-8, creating
a writer/reader encoding asymmetry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Expand XML doc comments on ClusterConfigMagic and
ReplicationHistoryMagic to explain:
- The human-readable prefix values ('GC' = Garnet Config,
  'GR' = Garnet Replication)
- Why a magic prefix is needed for backwards compatibility
  with the legacy pre-versioned format
- How each magic value is guaranteed to never collide with
  valid legacy payload headers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace static readonly byte[] with static ReadOnlySpan<byte>
properties backed by UTF-8 string literals ('GC'u8 / 'GR'u8).
This avoids heap allocation — the data is embedded directly in
the assembly as a compile-time constant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@vazois vazois changed the title Add version byte to ClusterConfig and ReplicationHistory serialization Add version int to ClusterConfig and ReplicationHistory serialization May 7, 2026
vazois and others added 5 commits May 7, 2026 16:37
Update serialization version fields from byte (1 byte) to int (4 bytes)
for both ClusterConfig and ReplicationHistory. This changes the binary
layout to: 2-byte magic word + 4-byte int version.

- ClusterConfigVersion: byte -> int
- ReplicationHistoryVersion: byte -> int
- TryPeekVersion: reads 4 bytes via BitConverter.ToInt32
- FromByteArray: uses ReadInt32() with updated length checks
- Tests: updated to use BitConverter for version corruption/validation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…zation

Drop the 2-byte magic prefix ("GC"/"GR") from both formats since legacy
(pre-versioned) format support is not needed. The binary layout is now
simply: version int (4 bytes) + payload. This reduces gossip message
overhead by 2 bytes per exchange.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…onHistory

Revert version type from int (4 bytes) to byte (1 byte) for both
ClusterConfig and ReplicationHistory serialization. The binary layout
is now: version byte (1 byte) + payload, minimizing wire overhead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@vazois vazois changed the title Add version int to ClusterConfig and ReplicationHistory serialization Add version byte to ClusterConfig and ReplicationHistory serialization May 8, 2026
@vazois vazois merged commit b0eea3d into dev May 8, 2026
22 of 23 checks passed
@vazois vazois deleted the vazois/cluster-versioning branch May 8, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants