Skip to content

Join with context cancelation  #291

@dimitarvdimitrov

Description

@dimitarvdimitrov

Description

The existing (*Memberlist).Join method can take a long time to complete for large clusters. The problem is exacerbated when some of the addresses to join are non-existent IPs and we end up waiting the TCPTimeout duration on each of them.

For example we've observed in grafana/mimir that a full join initiated while most of the cluster members are restarting and changing IPs may take as long as 25 minutes. Nodes which are in the middle of a (*Memberlist).Join cannot be gracefully shut down until Join returns.

Proposal

Add context.Context argument to (*Memberlist).Join and check it between pushPulling with each node.

Alternatively, if you don't want to break existing client, we can create a new method JoinContext which does the above.

I'm creating this issue to get feedback on the idea. After discussion I am happy to open a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions