Skip to content

Conversation

@carlhoerberg
Copy link
Member

@carlhoerberg carlhoerberg commented Dec 7, 2025

Summary

  • Track replicas using individual etcd keys instead of comma-separated ISR list
  • Show all known replicas in Nodes page with Connected/In-sync status columns
  • Add ability to forget disconnected replicas via UI and CLI
  • Generate sequential cluster node IDs using atomic etcd transactions

Changes

Use individual etcd keys for ISR tracking

Instead of maintaining a comma-separated list at a single "isr" key, each replica now has its own key at {prefix}/replica/{id}/insync with value "1" for in-sync and "0" for out-of-sync.

Show all known replicas in nodes page

  • Add known_replicas method to Clustering::Server that queries etcd
  • Merge etcd replicas with connected followers in NodesController
  • Add Connected and In-sync columns to followers table
  • Show leader role in replicas table, sorted first

Add forget button for disconnected replicas

  • Allow removing disconnected replicas via Nodes page UI or DELETE /api/nodes/:id
  • Prevent forgetting leader node
  • Add lavinmqctl forget_cluster_node <id> CLI command

Generate sequential cluster node IDs

  • New nodes get sequential IDs (1, 2, 3, ...) instead of random IDs
  • Uses atomic etcd transactions to ensure IDs don't collide
  • Handles manually created nodes by checking for existing replica keys

Test plan

  • Start a 3-node cluster and verify nodes get IDs 1, 2, 3
  • Verify Nodes page shows all replicas with correct Connected/In-sync status
  • Stop a follower and verify it shows as disconnected
  • Use Forget button to remove disconnected replica
  • Verify lavinmqctl forget_cluster_node works
  • Restart forgotten node and verify it gets a new sequential ID

🤖 Generated with Claude Code

@carlhoerberg carlhoerberg requested a review from a team as a code owner December 7, 2025 04:37
@claude
Copy link

claude bot commented Dec 7, 2025

Claude finished @carlhoerberg's task —— View job


Code Review Complete

  • Read and analyze changed files for bugs, security issues, performance problems, and Crystal anti-patterns

Issues Found

1. Missing Error Handling in Etcd Operations

File: src/lavinmq/clustering/server.cr:149-161

  • The known_replicas method calls @etcd.get_prefix(prefix) but doesn't handle potential etcd communication failures
  • Impact: Network failures or etcd outages could cause unhandled exceptions
  • Recommendation: Add exception handling around etcd operations

2. Potential Integer Overflow in Sequential ID Generation

File: src/lavinmq/clustering/controller.cr:81-89

  • The next_cluster_node_id method increments id indefinitely without bounds checking
  • Impact: In theory, could overflow after Int32::MAX iterations (though unlikely in practice)
  • Recommendation: Add reasonable upper bound check or use Int64

3. Missing Validation in Node Forgetting API

File: src/lavinmq/http/controller/nodes.cr:163-184

  • The delete "/api/nodes/:id" endpoint converts string to Int32 but only checks for nil, not parsing errors
  • Invalid base-36 strings could cause exceptions
  • Recommendation: Add proper error handling for malformed ID strings

4. Race Condition in Follower Connection Handling

File: src/lavinmq/clustering/server.cr:194-202

  • When a follower with existing ID connects, the old follower is removed from @followers array without synchronization with etcd state updates
  • Impact: Brief window where etcd may still show old follower as in-sync while new follower is connecting
  • Recommendation: Ensure etcd state is updated atomically with follower replacement

5. Performance: Inefficient String Operations

File: src/lavinmq/clustering/server.cr:152-158

  • Multiple string splits and repeated array operations when parsing etcd keys
  • Impact: Performance degradation with many replicas
  • Recommendation: Use more efficient parsing or cache parsed results

No Issues Found For:

  • Security vulnerabilities
  • Major Crystal anti-patterns
  • Missing memory management issues

@carlhoerberg
Copy link
Member Author

Fixes #1425

@carlhoerberg carlhoerberg changed the title Show all known replicas in nodes page with Connected/In-sync columns Improved cluster replica tracking and management Dec 8, 2025
carlhoerberg and others added 7 commits December 9, 2025 00:06
Instead of maintaining a comma-separated list at a single "isr" key,
each replica now has its own key at `{prefix}/replica/{id}/insync`
with value "1" for in-sync and "0" for out-of-sync. Followers are
only marked out-of-sync when a change arrives that they can't receive.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add known_replicas method to Clustering::Server that queries etcd
- Merge etcd replicas with connected followers in NodesController
- Add Connected and In-sync columns to followers table in nodes.ecr
- Update nodes.js to render new columns with checkmarks
- Simplify followers API: remove syncing_followers/all_followers, keep single followers method
- Fix etcd.get_prefix to use map instead of map! (type mismatch fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Prevents a reconnecting follower from being incorrectly marked as
out-of-sync. Also moves full_sync into begin block so ensure runs
on sync failure.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Allow administrators to remove disconnected replicas from etcd via the
Nodes page UI or DELETE /api/nodes/:id endpoint. The button only appears
for disconnected followers and includes a confirmation dialog.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add role field (leader/follower) to replica API response
- Display "Leader" in Connected column for leader node
- Sort replicas with leader first
- Prevent forgetting leader node via API and hide Forget button

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Allows removing disconnected nodes from the cluster via CLI:
  lavinmqctl forget_cluster_node <id>

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Instead of generating random cluster node IDs, new nodes now get
sequential IDs (1, 2, 3, ...) by atomically creating the replica
insync key in etcd. This ensures IDs don't collide with existing
nodes, including manually created ones.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Support rolling upgrades from v2.6.x by checking both the new individual
replica keys and the old comma-separated ISR key format. When a new leader
takes over, it deletes the legacy key to clean up.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link
Member

@snichme snichme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the code in Clustering::Server but haven't worked with that code so cannot give a good review there. Will run the code locally as well.

Possible to add some more specs, feel like there is a lot of logic added but not much specs around that

run_queue: 0,
sockets_used: @amqp_server.vhosts.sum { |_, v| v.connections.size },
followers: @amqp_server.followers,
followers: merged_replicas,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we change the response? Meaning the keys for each follower is not the same? If so this is a breaking change. We could handle it like that or add merged_replicas as additional data in the response like so

Suggested change
followers: merged_replicas,
followers: @amqp_server.followers,
merged_replicas: merged_replicas

end
end

# Add any connected followers not yet in etcd (shouldn't happen but be safe)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can a connected follower not be in etcd?

@id.to_s(str, 36)
private def mark_insync(id : Int32)
key = "#{@config.clustering_etcd_prefix}/replica/#{id.to_s(36)}/insync"
@etcd.put(key, "1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rescue Etcd::Error and re-raise with something more specific here to improve the error handling.

if isr = @etcd.get(old_key)
legacy_isr_exists = true
if isr.split(",").map(&.to_i(36)).includes?(@id)
Log.info { "In sync via legacy ISR key" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are insync vid legacy ISR key, shouldn't we write that we are insync in with the new key as well?

@snichme snichme linked an issue Dec 19, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OSR(out of sync replicas) for better nodes interface

3 participants