-
Notifications
You must be signed in to change notification settings - Fork 469
Join with context cancelation #291
Description
Description
The existing (*Memberlist).Join method can take a long time to complete for large clusters. The problem is exacerbated when some of the addresses to join are non-existent IPs and we end up waiting the TCPTimeout duration on each of them.
For example we've observed in grafana/mimir that a full join initiated while most of the cluster members are restarting and changing IPs may take as long as 25 minutes. Nodes which are in the middle of a (*Memberlist).Join cannot be gracefully shut down until Join returns.
Proposal
Add context.Context argument to (*Memberlist).Join and check it between pushPulling with each node.
Alternatively, if you don't want to break existing client, we can create a new method JoinContext which does the above.
I'm creating this issue to get feedback on the idea. After discussion I am happy to open a PR.