Serverside keepalive error detection and cleanups #2756

davidfiala · 2024-05-28T18:22:51Z

Improve server-side keepalives, possibly resolve bug where keepalive errors were not being treated as errors.

Ensure that http2 ping failures are detected at request time and in the callback. Code inspection indicates both cases may have silently been ignored before.
Ensure keepalive intervals and timeouts are always cleared.
Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue)
Rename timers/intervals for clarity
Ensure all paths do contain a channelz keepalive trace message with unique error strings.

Note: I think that in nodejs/node#18447 half of the issue was clarified already, but the code just wasn't updated to reflect the answer there. The other half is checking the return value of session.ping for a non-true result, too. Given my experience of keepalives not working in #2734 my current conclusion is that although keepalives are being sent (I inspected via manual logging), failures may not have been properly detected.

…errors were not being treated as errors. - Ensure that http2 ping failures are detected at request time and in the callback. Code inspection indicates both cases may have silently been ignored before. - Ensure keepalive intervals and timeouts are always cleared. - Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue) - Rename timers/intervals for clarity - Ensure all paths do contain a channelz keepalive trace message with unique error strings.

…ace and session destroyed. transport.ts has a more comprehensive approach, but this ensures that an overly ping-happy server will at least take action based on a client's expectations. AIUI, before this patch the server did not take any action on a goaway frame.

murgatroid99 · 2024-05-28T20:14:39Z

Unfortunately, the master branch is actually behind the release branch on this, so your changes are going to conflict with that. How about I merge those changes into master and then we look at this again?

On the plus side, that version does already handle the ping error, so you won't need to do that again. On the minus side, there are two parallel implementations of keepalives for performance reasons, so any changes will need to be made twice.

davidfiala · 2024-05-28T20:41:53Z

Thank you, Michael. Happy to come back to this after the merge. If OTOH you would be happy with me making a PR(s) to fix this in one or more release branches, I can do that too. Whichever you think is easier for you to get it out faster. LMK which to ones target if you think it's better for me to PR to branches.

murgatroid99 · 2024-05-28T20:48:33Z

The current release branch is @grpc/[email protected]. You can make the change on that branch.

- Bugfix: Ensure that if session.ping returns false we correctly identify fail the keepalive and connection - Bugfix: Ensure that if the interval between keepalives being sent occurs faster than the prior keepalive's timeout that we do not overwrite the reference to the prior timeout. Prior implementation could have in theory prevented a valid keepalive timeout from clearing itself. This rewrite keeps every timeout as a local (vs a shared state per session). Even if the timeout outlives the lifetime of a session, we still guard against errors by checking that the parent interval is not false-y. I reckon this could result in a short-term memory leak per session which is bounded for a maximum of keepaliveTimeoutMs. On the other hand even with that potential for a short reference hold, this implementation proposed here is more correct I think. One alternative we could do is keep a list of pending timeouts.. which is complex for a rare situation that will self resolve anyhow when keepaliveTimeoutMs is reached. - Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue) - Rename variables for clarity, to prevent future bugs like swapping clearInterval vs clearTimeout. - Implementation is repeated in two places, per warning from grpc#2756 (comment) - This commit supercedes the prior PR on a master branch which was out of date. grpc#2756

davidfiala · 2024-05-28T21:57:49Z

Thank you- I've created #2760 -- I think we can close this PR here then if it is too stale to use? Closing now.

davidfiala mentioned this pull request May 28, 2024

grpc-js servers not sending keepalives #2734

Closed

davidfiala mentioned this pull request May 28, 2024

Keepalive bugfixes and unify timers strategies between client and server #2760

Merged

davidfiala closed this May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serverside keepalive error detection and cleanups #2756

Serverside keepalive error detection and cleanups #2756

davidfiala commented May 28, 2024

murgatroid99 commented May 28, 2024

davidfiala commented May 28, 2024

murgatroid99 commented May 28, 2024

davidfiala commented May 28, 2024 •

edited

Loading

Serverside keepalive error detection and cleanups #2756

Serverside keepalive error detection and cleanups #2756

Conversation

davidfiala commented May 28, 2024

murgatroid99 commented May 28, 2024

davidfiala commented May 28, 2024

murgatroid99 commented May 28, 2024

davidfiala commented May 28, 2024 • edited Loading

davidfiala commented May 28, 2024 •

edited

Loading