fix: Improve cluster connection pool logic when disconnecting #1864

martinslota · 2024-03-06T22:10:04Z

Motivation and Background

This is an attempt to fix errors occurring when a connect() call is made shortly after a disconnect(), which is something that the Bull library does when pausing a queue.

Here's a relatively minimal way to reproduce an error:

import IORedis from "ioredis";

const cluster = new IORedis.Cluster([{ host: "localhost", port: 6380 }]);

await cluster.set("foo", "bar");

const endPromise = new Promise((resolve) => cluster.once("end", resolve));
await cluster.quit();
cluster.disconnect();
await endPromise;

cluster.connect();
console.log(await cluster.get("foo"));
cluster.disconnect();

Running that script in a loop using

#!/bin/bash

set -euo pipefail

while true
do
    DEBUG=ioredis:cluster node cluster-error.mjs
done

against the main branch of ioredis quickly results in this output:

/Code/ioredis/built/cluster/index.js:124
                    reject(new redis_errors_1.RedisError("Connection is aborted"));
                           ^

RedisError: Connection is aborted
    at /Code/ioredis/built/cluster/index.js:124:28

Node.js v20.11.0

My debugging led me to believe that the existing node cleanup logic in the ConnectionPool class leads to race conditions: upon disconnect(), the this.connectionPool.reset() call will remove nodes from the pool without cleaning up the event listener which may then subsequently issue more than one drain event. Depending on timing, one of the extra drain events may fire after connect() and change the status to close, interfering with the connection attempt and leading to the error above.

Changes

Keep track of node listeners in the ConnectionPool class and remove them from the nodes whenever they are removed from the pool.
Issue -node / drain regardless of whether nodes disconnected or were removed through a reset() call.
Within reset(), add nodes before removing old ones to avoid unwanted drain events.
Fix one of the listeners by using an arrow function to make this point to the connection pool instance.
Try to fix the script for running cluster tests and attempt to enable them on CI. If this doesn't work out or isn't useful, I'm happy to revert the changes.
Add a test around this issue. The error thrown in the test on main is seemingly different from the error shown above but it still seems related to the disconnection logic and still gets fixed by the changes in this PR.

… to the connection pool instance

…connect

martinslota · 2024-06-10T11:39:14Z

I now created a separate repository that (hopefully) makes it easy to reproduce the bug.

We have been using the fix in this branch in production throughout the last roughly 3 months and it has considerably reduced the error rates we are seeing when shutting down Bull queue clients.

This reverts commit 2979176.

…e to connect using the Cluster client

…) is finished

martinslota · 2024-08-15T21:12:31Z

I just pushed the fixes identified in valkey-io/iovalkey#5.

martinslota added 16 commits March 6, 2024 21:47

Tell Redis cluster to disable protected mode before running tests

0d2a416

Try to enable Redis cluster tests on CI

5385c8e

Add a failing test around Redis cluster disconnection logic

f67d73a

Rename function parameter

66efb1a

Turn node error listener into an arrow function so that this points…

f0cadc5

… to the connection pool instance

Extract node listeners into separate constants

49e9edd

Keep track of listeners along with each Redis client

dee2623

Remove node listeners when the node is being removed

4d519a6

Emit node removal events whenever a node is removed

ed3e190

When resetting, add nodes before removing old ones

cd3b74a

Rename Node type to NodeRecord for clarity

2b47e94

Also rename the field holding node records

e835185

Rename variable to nodeRecord

eb2d07c

Rename another variable to nodeRecord

62427a8

Fix a reference to connection pool nodes

2979176

Do not fail when retrieving a node by non-existing key

4d4ba69

martinslota mentioned this pull request Mar 25, 2024

fix: avoid race condition when reconnecting OptimalBits/bull#2716

Closed

Merge branch 'main' into martinslota/clean-up-node-listeners-upon-dis…

1c3df53

…connect

martinslota mentioned this pull request Jun 10, 2024

fix: Improve cluster connection pool logic when disconnecting valkey-io/iovalkey#5

Merged

martinslota added 4 commits August 15, 2024 22:40

Revert "Fix a reference to connection pool nodes"

5a490c1

This reverts commit 2979176.

Fix a reference to connection pool nodes, this time a bit more correctly

d67b1b6

Add a valid slots table to mock server in tests that expect to be abl…

d779cb4

…e to connect using the Cluster client

Do not assume that node removal will occur *after* refreshSlotsCache(…

3165ff0

…) is finished

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Improve cluster connection pool logic when disconnecting #1864

fix: Improve cluster connection pool logic when disconnecting #1864

martinslota commented Mar 6, 2024 •

edited

Loading

martinslota commented Jun 10, 2024

martinslota commented Aug 15, 2024

fix: Improve cluster connection pool logic when disconnecting #1864

Are you sure you want to change the base?

fix: Improve cluster connection pool logic when disconnecting #1864

Conversation

martinslota commented Mar 6, 2024 • edited Loading

Motivation and Background

Changes

martinslota commented Jun 10, 2024

martinslota commented Aug 15, 2024

martinslota commented Mar 6, 2024 •

edited

Loading