Fred version - 10.1.0
Redis version - Redis 7.1 (AWS MemoryDB)
Platform - linux
Deployment type - cluster
Describe the bug
When failover happens in AWS MemoryDB, fred is unable to reconnect to the cluster for ~10 minutes, although the failover happens within seconds.
All requests return "Timeout Error: Request timed out."
To Reproduce
I've prepared a minimal reproducer at https://github.com/dmitryvk/fred-repro-failover
Steps to reproduce the behavior:
- clone the repo https://github.com/dmitryvk/fred-repro-failover
- run
docker compose -f deps/docker-compose.yml up -d && ./print-addrs.sh && cargo run
- After 10 seconds, it prints
thread 'main' panicked at src/main.rs:32:37:
called `Result::unwrap()` on an `Err` value: Error { details: "Request timed out.", kind: Timeout }
Logs
(If possible set RUST_LOG=fred=trace and run with --features debug-ids)
connecting
2025-07-29T12:52:39.326481Z DEBUG fred::router::commands: fred-ZrJNA4t5uH: Initializing router with policy: None
2025-07-29T12:52:39.326542Z DEBUG fred::router::centralized: fred-ZrJNA4t5uH: Initializing centralized connection.
2025-07-29T12:52:39.326585Z TRACE fred::protocol::connection: fred-ZrJNA4t5uH: Checking connection type. Native-tls: false, Rustls: false
2025-07-29T12:52:39.326808Z DEBUG hickory_proto::xfer::dns_handle: querying: redis.redis. A
2025-07-29T12:52:39.326873Z DEBUG hickory_resolver::name_server::name_server_pool: sending request: [Query { name: Name("redis.redis."), query_type: A, query_class: IN }]
2025-07-29T12:52:39.326932Z DEBUG hickory_resolver::name_server::name_server: reconnecting: NameServerConfig { socket_addr: 127.0.0.1:1053, protocol: Udp, tls_dns_name: None, http_endpoint: None, trust_negative_responses: true, bind_addr: None }
2025-07-29T12:52:39.326996Z DEBUG hickory_proto::xfer: enqueueing message:QUERY:[Query { name: Name("redis.redis."), query_type: A, query_class: IN }]
2025-07-29T12:52:39.327084Z DEBUG hickory_proto::udp::udp_client_stream: final message: ; header 23987:QUERY:RD:NoError:QUERY:0/0/0
; query
;; redis.redis. IN A
2025-07-29T12:52:39.327200Z TRACE hickory_proto::udp::udp_stream: binding UDP socket port=1028
2025-07-29T12:52:39.327267Z DEBUG hickory_proto::udp::udp_stream: created socket successfully
2025-07-29T12:52:39.327340Z TRACE hickory_proto::udp::udp_client_stream: creating UDP receive buffer with size 512
2025-07-29T12:52:39.327825Z TRACE hickory_proto::rr::record_data: reading A
2025-07-29T12:52:39.327859Z TRACE hickory_proto::rr::record_data: reading A
2025-07-29T12:52:39.327885Z DEBUG hickory_proto::udp::udp_client_stream: received message id: 23987
2025-07-29T12:52:39.327927Z DEBUG hickory_proto::error: response: ; header 23987:RESPONSE:RD,AA,RA:NoError:QUERY:2/0/0
; query
;; redis.redis. IN A
; answers 2
redis.redis. 0 IN A 172.29.0.2
redis.redis. 0 IN A 172.29.0.3
; nameservers 0
; additionals 0
2025-07-29T12:52:39.327974Z DEBUG hickory_proto::error: response: ; header 23987:RESPONSE:RD,AA,RA:NoError:QUERY:2/0/0
; query
;; redis.redis. IN A
; answers 2
redis.redis. 0 IN A 172.29.0.2
redis.redis. 0 IN A 172.29.0.3
; nameservers 0
; additionals 0
resolved [172.29.0.2:6379, 172.29.0.3:6379]
2025-07-29T12:52:39.328099Z DEBUG fred::protocol::connection: fred-ZrJNA4t5uH: Creating TCP connection to redis.redis at 172.29.0.2:6379
2025-07-29T12:52:39.328387Z TRACE fred::protocol::codec: fred-ZrJNA4t5uH: Encoded 14 bytes to redis.redis:6379. Buffer len: 14 (RESP2)
2025-07-29T12:52:49.329987Z DEBUG fred::modules::inner: fred-ZrJNA4t5uH: No `on_error` listener. The error was: Error { details: "Request timed out.", kind: Timeout }
2025-07-29T12:52:49.330043Z TRACE fred::runtime::_tokio: fred-ZrJNA4t5uH: Ending connection task with Err(Error { details: "Request timed out.", kind: Timeout })
thread 'main' panicked at src/main.rs:32:37:
called `Result::unwrap()` on an `Err` value: Error { details: "Request timed out.", kind: Timeout }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2025-07-29T12:52:49.330495Z DEBUG hickory_proto::xfer::dns_exchange: io_stream is done, shutting down
My analysis
When failover happens in AWS MemoryDB, the DNS entry for the MemoryDB endpoint returns 2 A entries - one for the failed cluster, the other one for functioning (promoted) cluster. Both endpoints accept connections, but the endpoint for the failed cluster does not send anything back after having accepted the connection.
I have simulated that situation using docker and dnsmasq - there's one DNS entry for the functioning cluster and one for the failed cluster.
I can see that fred establishes a TCP connection to the first endpoint and discards the other entries. The endpoint that it connected to is non-functioning, and fred keeps using that endpoint and get only timeouts.
Fred version - 10.1.0
Redis version - Redis 7.1 (AWS MemoryDB)
Platform - linux
Deployment type - cluster
Describe the bug
When failover happens in AWS MemoryDB, fred is unable to reconnect to the cluster for ~10 minutes, although the failover happens within seconds.
All requests return "Timeout Error: Request timed out."
To Reproduce
I've prepared a minimal reproducer at https://github.com/dmitryvk/fred-repro-failover
Steps to reproduce the behavior:
docker compose -f deps/docker-compose.yml up -d && ./print-addrs.sh && cargo runLogs
(If possible set
RUST_LOG=fred=traceand run with--features debug-ids)My analysis
When failover happens in AWS MemoryDB, the DNS entry for the MemoryDB endpoint returns 2 A entries - one for the failed cluster, the other one for functioning (promoted) cluster. Both endpoints accept connections, but the endpoint for the failed cluster does not send anything back after having accepted the connection.
I have simulated that situation using docker and dnsmasq - there's one DNS entry for the functioning cluster and one for the failed cluster.
I can see that fred establishes a TCP connection to the first endpoint and discards the other entries. The endpoint that it connected to is non-functioning, and fred keeps using that endpoint and get only timeouts.