Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc-js servers not sending keepalives #2734

Closed
davidfiala opened this issue May 2, 2024 · 9 comments
Closed

grpc-js servers not sending keepalives #2734

davidfiala opened this issue May 2, 2024 · 9 comments

Comments

@davidfiala
Copy link
Contributor

I've noticed that grpc-js servers do not seem to be sending keepalives to gRPC clients when they are connected to a long-lived stream. (But only within my larger applications so far)

I'm very happy to create a tiny PoC and then upload it here, but before I build it up, I wanted to clarify if grpc-js ChannelOptions were in fact intended for gRPC servers to send PINGs to clients. Or if it was only meant to be clients and I misunderstood the docs. I can confirm that clients are happily sendings PINGs as expected in 1.10.6

For example, I've instrumented the grpc-js server.ts file to log when pings may happen, but I don't see it. Server has long-lived bidi streams from a client and the server is configured with:

createServer({
    'grpc.keepalive_time_ms': args.grpc_server_keepalive_ms,
    'grpc.keepalive_timeout_ms': args.grpc_server_keepalive_timeout_ms,
    'grpc.keepalive_permit_without_calls': 1,
  })

... and no pings are sent and no keepalive disconnects will occur, in my app so far.

From https://grpc.io/docs/guides/keepalive/ it looks like it has under the column Availability both Client and Server for keepalives. And https://github.com/grpc/grpc/blob/master/doc/keepalive.md says that if "if the ping is not acknowledged by the peer".

I haven't ruled out a bug in my own setup yet, but I would like to verify that the intention is that both either/or the client/server can independently set the keep alives and transmit their own PINGs. I'll be happy to make a PoC and continue this bug with my findings if yes.

Thank you for your time!

@murgatroid99
Copy link
Member

Server-side keepalives were added recently in grpc-js, but they should be working in version 1.10.6. However, they're not logged on the server the way they are on the client, so you can't verify that they're being sent in the same way.

What are your observations that suggest that the server is not sending pings?

@davidfiala
Copy link
Contributor Author

davidfiala commented May 2, 2024

Originally it was because GCP Cloud NAT has a 10 minute limit for idle connections with no TCP traffic. It just drops your TCP connection for any VM running on GCE if there's no packets. I had a long-lived gRPC bidi stream that fit this profile and grpc-js was not reporting a disconnect on either the client or server. Just hangs the connection forever and doesn't realize it's dead even when I try to later use the channel.

On the server: I set grpc.keepalive_time_ms to 6 minutes. And I set grpc.keepalive_timeout_ms to 20 seconds. But it didn't have any effect: The connections still just died without any indication. Then I added logging to server.ts in _sessionHandler but I didn't observe any logged statements. (Maybe I did it wrong). And it didn't help the problem.

Moving on to the client:

I then added the same keepalive channel options to my gRPC clients and added logging in transport.ts. Both the logging works and the keepalives are keeping my GCE TCP connections alive even when idle now.

As an aside, I do wrap grpc with nice-grpc package often. So I haven't ruled out that dependency as a cause yet.

EDIT: in all cases, there was a long-lived bidi stream. This wasn't a situation where I had an open channel without an ongoing RPC.

@davidfiala
Copy link
Contributor Author

davidfiala commented May 27, 2024

Haven't forgotten about this. I made a PoC with unary and streaming RPCs using base grpc-js from head just now, and keepalives originating from both the client and server do seem to transmit. I'm trying next to see if somehow my use when going through nice-grpc is causing the options to not get set correctly.

I've identified a few areas for improvement, hoping you'll consider a PR for them:

  • session.ping should probably disconnect if it detects an error. right now, the code simply clears the keepalive timeout regardless of the parameters returned to the session.ping callback. this makes me think that error cases on the ping are being ignored as if they were a success.
    clearTimeout(timeoutTImer);
    (edit): not only should I check for err argument, I should also check if session.ping returns false which indicates a ping was not sent (per https://nodejs.org/api/http2.html#http2sessionpingpayload-callback )
  • add trace messaging to both client and server. as you mentioned, it's only partially in the server side code right now
  • optional: unify how keepalive code works in client and server. the server uses a setInterval, and the client's code appears to repeatedly use setTimeout IIRC. the setInterval strategy on the server allows for multiple pings to potentially overlap, whereas the client protects against this at first glance.
  • the server side code doesn't appear to allow you to turn off keepalives.
  • cleanup a couple unnecessary non-null assertation operators: https://github.com/search?q=repo%3Agrpc%2Fgrpc-node%20session!&type=code

@davidfiala
Copy link
Contributor Author

If you like my changes, I've drafted up some fixes to what I think is the most obvious (to my eyes) potential cause in #2756

@davidfiala
Copy link
Contributor Author

davidfiala commented May 28, 2024

I'm kind of surprised that no one else has noticed this bug. Especially since GCP's firewall auto-drops connections after 10 minutes of inactivity.

Hypothesis:

  • gRPC users that focus on unary RPCs would receive a timeout error or the channel will silently reconnect automatically on the next RPC and they'd never know they were interrupted, except for a timeout hang.
  • not many people use long-lived streams which happen to sit inactive for periods > 10 minutes

OTOH, the code inspection plus nodejs/node#18447 seem like strong evidence that the behaviors I was observing are in fact aligned.

If you have any thoughts, LMK. I'm happy to submit a second PR for improving keepalives on the client side as well with logging. Getting this right is pretty important for my use case.

@murgatroid99
Copy link
Member

Here's my guess for why this bug wasn't noticed:

  • The keepalive functionality was added to this server implementation relatively recently. On top of that, keepalives are less impactful on the server than on the client, because the client is responsible for reconnecting when a connection drops. As a result, it's likely that few users have keepalives enabled at all.
  • Node can detect and report connection drops on its own, so failing to detect keepalive errors only matters when the connection drop is also not detected in any other way.
  • The persistence of stale connections and even stale streams should be relatively unimpactful on the server. You might see a bit of memory leakage, or a stream that has become inactive, but often not visible failures. This is as opposed to on the client, where a stale connection can prevent the client from making requests and/or being responsive.

@davidfiala
Copy link
Contributor Author

FWIW, in my original bug I obsered that neither the client nor server were detecting the disconnects. This is because when GCP decided to drop a connection it was not sending any type of TCP RST to either party. So both of them simply thought the TCP connection was still alive, but the GCP router in the middle would just silently drop packets past the 10 minute idle mark with zero notification.

I took a stab at #2760 that I hope is easier for you to review and merge.

@murgatroid99
Copy link
Member

Right, but the client keepalives should detect the disconnect independent of what the server does. And if only the server detects the disconnect, it can't do anything about it, because it can't reestablish the connection or inform the client of the error.

@murgatroid99
Copy link
Member

I think this is resolved by your changes in #2760, which are now out in version 1.10.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants