Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve cluster connection pool logic when disconnecting #1864

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

martinslota
Copy link

@martinslota martinslota commented Mar 6, 2024

Motivation and Background

This is an attempt to fix errors occurring when a connect() call is made shortly after a disconnect(), which is something that the Bull library does when pausing a queue.

Here's a relatively minimal way to reproduce an error:

import IORedis from "ioredis";

const cluster = new IORedis.Cluster([{ host: "localhost", port: 6380 }]);

await cluster.set("foo", "bar");

const endPromise = new Promise((resolve) => cluster.once("end", resolve));
await cluster.quit();
cluster.disconnect();
await endPromise;

cluster.connect();
console.log(await cluster.get("foo"));
cluster.disconnect();

Running that script in a loop using

#!/bin/bash

set -euo pipefail

while true
do
    DEBUG=ioredis:cluster node cluster-error.mjs
done

against the main branch of ioredis quickly results in this output:

/Code/ioredis/built/cluster/index.js:124
                    reject(new redis_errors_1.RedisError("Connection is aborted"));
                           ^

RedisError: Connection is aborted
    at /Code/ioredis/built/cluster/index.js:124:28

Node.js v20.11.0

My debugging led me to believe that the existing node cleanup logic in the ConnectionPool class leads to race conditions: upon disconnect(), the this.connectionPool.reset() call will remove nodes from the pool without cleaning up the event listener which may then subsequently issue more than one drain event. Depending on timing, one of the extra drain events may fire after connect() and change the status to close, interfering with the connection attempt and leading to the error above.

Changes

  • Keep track of node listeners in the ConnectionPool class and remove them from the nodes whenever they are removed from the pool.
  • Issue -node / drain regardless of whether nodes disconnected or were removed through a reset() call.
  • Within reset(), add nodes before removing old ones to avoid unwanted drain events.
  • Fix one of the listeners by using an arrow function to make this point to the connection pool instance.
  • Try to fix the script for running cluster tests and attempt to enable them on CI. If this doesn't work out or isn't useful, I'm happy to revert the changes.
  • Add a test around this issue. The error thrown in the test on main is seemingly different from the error shown above but it still seems related to the disconnection logic and still gets fixed by the changes in this PR.

@martinslota
Copy link
Author

I now created a separate repository that (hopefully) makes it easy to reproduce the bug.

We have been using the fix in this branch in production throughout the last roughly 3 months and it has considerably reduced the error rates we are seeing when shutting down Bull queue clients.

@martinslota
Copy link
Author

I just pushed the fixes identified in valkey-io/iovalkey#5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant