grpc-js servers not sending keepalives #2734

davidfiala · 2024-05-02T18:18:25Z

I've noticed that grpc-js servers do not seem to be sending keepalives to gRPC clients when they are connected to a long-lived stream. (But only within my larger applications so far)

I'm very happy to create a tiny PoC and then upload it here, but before I build it up, I wanted to clarify if grpc-js ChannelOptions were in fact intended for gRPC servers to send PINGs to clients. Or if it was only meant to be clients and I misunderstood the docs. I can confirm that clients are happily sendings PINGs as expected in 1.10.6

For example, I've instrumented the grpc-js server.ts file to log when pings may happen, but I don't see it. Server has long-lived bidi streams from a client and the server is configured with:

createServer({
    'grpc.keepalive_time_ms': args.grpc_server_keepalive_ms,
    'grpc.keepalive_timeout_ms': args.grpc_server_keepalive_timeout_ms,
    'grpc.keepalive_permit_without_calls': 1,
  })

... and no pings are sent and no keepalive disconnects will occur, in my app so far.

From https://grpc.io/docs/guides/keepalive/ it looks like it has under the column Availability both Client and Server for keepalives. And https://github.com/grpc/grpc/blob/master/doc/keepalive.md says that if "if the ping is not acknowledged by the peer".

I haven't ruled out a bug in my own setup yet, but I would like to verify that the intention is that both either/or the client/server can independently set the keep alives and transmit their own PINGs. I'll be happy to make a PoC and continue this bug with my findings if yes.

Thank you for your time!

The text was updated successfully, but these errors were encountered:

murgatroid99 · 2024-05-02T18:33:57Z

Server-side keepalives were added recently in grpc-js, but they should be working in version 1.10.6. However, they're not logged on the server the way they are on the client, so you can't verify that they're being sent in the same way.

What are your observations that suggest that the server is not sending pings?

davidfiala · 2024-05-02T18:39:45Z

Originally it was because GCP Cloud NAT has a 10 minute limit for idle connections with no TCP traffic. It just drops your TCP connection for any VM running on GCE if there's no packets. I had a long-lived gRPC bidi stream that fit this profile and grpc-js was not reporting a disconnect on either the client or server. Just hangs the connection forever and doesn't realize it's dead even when I try to later use the channel.

On the server: I set grpc.keepalive_time_ms to 6 minutes. And I set grpc.keepalive_timeout_ms to 20 seconds. But it didn't have any effect: The connections still just died without any indication. Then I added logging to server.ts in _sessionHandler but I didn't observe any logged statements. (Maybe I did it wrong). And it didn't help the problem.

Moving on to the client:

I then added the same keepalive channel options to my gRPC clients and added logging in transport.ts. Both the logging works and the keepalives are keeping my GCE TCP connections alive even when idle now.

As an aside, I do wrap grpc with nice-grpc package often. So I haven't ruled out that dependency as a cause yet.

EDIT: in all cases, there was a long-lived bidi stream. This wasn't a situation where I had an open channel without an ongoing RPC.

davidfiala · 2024-05-27T23:35:29Z

Haven't forgotten about this. I made a PoC with unary and streaming RPCs using base grpc-js from head just now, and keepalives originating from both the client and server do seem to transmit. I'm trying next to see if somehow my use when going through nice-grpc is causing the options to not get set correctly.

I've identified a few areas for improvement, hoping you'll consider a PR for them:

session.ping should probably disconnect if it detects an error. right now, the code simply clears the keepalive timeout regardless of the parameters returned to the session.ping callback. this makes me think that error cases on the ping are being ignored as if they were a success.

grpc-node/packages/grpc-js/src/server.ts

Line 1424 in 15b7d4d

clearTimeout(timeoutTImer);

(edit): not only should I check for err argument, I should also check if session.ping returns false which indicates a ping was not sent (per https://nodejs.org/api/http2.html#http2sessionpingpayload-callback )
add trace messaging to both client and server. as you mentioned, it's only partially in the server side code right now
optional: unify how keepalive code works in client and server. the server uses a setInterval, and the client's code appears to repeatedly use setTimeout IIRC. the setInterval strategy on the server allows for multiple pings to potentially overlap, whereas the client protects against this at first glance.
the server side code doesn't appear to allow you to turn off keepalives.
cleanup a couple unnecessary non-null assertation operators: https://github.com/search?q=repo%3Agrpc%2Fgrpc-node%20session!&type=code

davidfiala · 2024-05-28T18:23:36Z

If you like my changes, I've drafted up some fixes to what I think is the most obvious (to my eyes) potential cause in #2756

davidfiala · 2024-05-28T18:33:55Z

I'm kind of surprised that no one else has noticed this bug. Especially since GCP's firewall auto-drops connections after 10 minutes of inactivity.

Hypothesis:

gRPC users that focus on unary RPCs would receive a timeout error or the channel will silently reconnect automatically on the next RPC and they'd never know they were interrupted, except for a timeout hang.
not many people use long-lived streams which happen to sit inactive for periods > 10 minutes

OTOH, the code inspection plus nodejs/node#18447 seem like strong evidence that the behaviors I was observing are in fact aligned.

If you have any thoughts, LMK. I'm happy to submit a second PR for improving keepalives on the client side as well with logging. Getting this right is pretty important for my use case.

murgatroid99 · 2024-05-28T20:17:53Z

Here's my guess for why this bug wasn't noticed:

The keepalive functionality was added to this server implementation relatively recently. On top of that, keepalives are less impactful on the server than on the client, because the client is responsible for reconnecting when a connection drops. As a result, it's likely that few users have keepalives enabled at all.
Node can detect and report connection drops on its own, so failing to detect keepalive errors only matters when the connection drop is also not detected in any other way.
The persistence of stale connections and even stale streams should be relatively unimpactful on the server. You might see a bit of memory leakage, or a stream that has become inactive, but often not visible failures. This is as opposed to on the client, where a stale connection can prevent the client from making requests and/or being responsive.

davidfiala · 2024-05-28T22:04:58Z

FWIW, in my original bug I obsered that neither the client nor server were detecting the disconnects. This is because when GCP decided to drop a connection it was not sending any type of TCP RST to either party. So both of them simply thought the TCP connection was still alive, but the GCP router in the middle would just silently drop packets past the 10 minute idle mark with zero notification.

I took a stab at #2760 that I hope is easier for you to review and merge.

murgatroid99 · 2024-05-28T22:11:13Z

Right, but the client keepalives should detect the disconnect independent of what the server does. And if only the server detects the disconnect, it can't do anything about it, because it can't reestablish the connection or inform the client of the error.

murgatroid99 · 2024-06-24T22:29:15Z

I think this is resolved by your changes in #2760, which are now out in version 1.10.10.

This was referenced May 27, 2024

How to capture gRPC keep alives with tcpdump? #2751

Open

Serverside keepalive error detection and cleanups #2756

Closed

davidfiala mentioned this issue May 28, 2024

Keepalive bugfixes and unify timers strategies between client and server #2760

Merged

murgatroid99 closed this as completed Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpc-js servers not sending keepalives #2734

grpc-js servers not sending keepalives #2734

davidfiala commented May 2, 2024

murgatroid99 commented May 2, 2024

davidfiala commented May 2, 2024 •

edited

Loading

davidfiala commented May 27, 2024 •

edited

Loading

davidfiala commented May 28, 2024

davidfiala commented May 28, 2024 •

edited

Loading

murgatroid99 commented May 28, 2024

davidfiala commented May 28, 2024

murgatroid99 commented May 28, 2024

murgatroid99 commented Jun 24, 2024

grpc-js servers not sending keepalives #2734

grpc-js servers not sending keepalives #2734

Comments

davidfiala commented May 2, 2024

murgatroid99 commented May 2, 2024

davidfiala commented May 2, 2024 • edited Loading

davidfiala commented May 27, 2024 • edited Loading

davidfiala commented May 28, 2024

davidfiala commented May 28, 2024 • edited Loading

murgatroid99 commented May 28, 2024

davidfiala commented May 28, 2024

murgatroid99 commented May 28, 2024

murgatroid99 commented Jun 24, 2024

davidfiala commented May 2, 2024 •

edited

Loading

davidfiala commented May 27, 2024 •

edited

Loading

davidfiala commented May 28, 2024 •

edited

Loading