-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serverside keepalive error detection and cleanups #2756
Conversation
…errors were not being treated as errors. - Ensure that http2 ping failures are detected at request time and in the callback. Code inspection indicates both cases may have silently been ignored before. - Ensure keepalive intervals and timeouts are always cleared. - Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue) - Rename timers/intervals for clarity - Ensure all paths do contain a channelz keepalive trace message with unique error strings.
…ace and session destroyed. transport.ts has a more comprehensive approach, but this ensures that an overly ping-happy server will at least take action based on a client's expectations. AIUI, before this patch the server did not take any action on a goaway frame.
Unfortunately, the master branch is actually behind the release branch on this, so your changes are going to conflict with that. How about I merge those changes into master and then we look at this again? On the plus side, that version does already handle the ping error, so you won't need to do that again. On the minus side, there are two parallel implementations of keepalives for performance reasons, so any changes will need to be made twice. |
Thank you, Michael. Happy to come back to this after the merge. If OTOH you would be happy with me making a PR(s) to fix this in one or more release branches, I can do that too. Whichever you think is easier for you to get it out faster. LMK which to ones target if you think it's better for me to PR to branches. |
The current release branch is |
- Bugfix: Ensure that if session.ping returns false we correctly identify fail the keepalive and connection - Bugfix: Ensure that if the interval between keepalives being sent occurs faster than the prior keepalive's timeout that we do not overwrite the reference to the prior timeout. Prior implementation could have in theory prevented a valid keepalive timeout from clearing itself. This rewrite keeps every timeout as a local (vs a shared state per session). Even if the timeout outlives the lifetime of a session, we still guard against errors by checking that the parent interval is not false-y. I reckon this could result in a short-term memory leak per session which is bounded for a maximum of keepaliveTimeoutMs. On the other hand even with that potential for a short reference hold, this implementation proposed here is more correct I think. One alternative we could do is keep a list of pending timeouts.. which is complex for a rare situation that will self resolve anyhow when keepaliveTimeoutMs is reached. - Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue) - Rename variables for clarity, to prevent future bugs like swapping clearInterval vs clearTimeout. - Implementation is repeated in two places, per warning from grpc#2756 (comment) - This commit supercedes the prior PR on a master branch which was out of date. grpc#2756
Thank you- I've created #2760 -- I think we can close this PR here then if it is too stale to use? Closing now. |
Improve server-side keepalives, possibly resolve bug where keepalive errors were not being treated as errors.
Note: I think that in nodejs/node#18447 half of the issue was clarified already, but the code just wasn't updated to reflect the answer there. The other half is checking the return value of
session.ping
for a non-true result, too. Given my experience of keepalives not working in #2734 my current conclusion is that although keepalives are being sent (I inspected via manual logging), failures may not have been properly detected.