-
Notifications
You must be signed in to change notification settings - Fork 138
rfq+server: properly configure gRPC keep alive settings for client+server #1834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In this commit, we add comprehensive client-side gRPC keepalive parameters to the price oracle client connections. This addresses the root cause of issue lightninglabs#1814 where connections to price oracle servers were being silently closed after idle periods, resulting in "connection reset by peer" errors during RFQ operations. The key change is adding PermitWithoutStream set to true, which allows the client to send keepalive pings even when there are no active RPC calls. This is essential for long-lived connections that may experience extended idle periods between price queries. Without this setting, idle connections would be closed by intermediaries or the server itself, leaving the client unaware of the broken connection until the next RPC attempt. We configure the client to ping the server every 30 seconds of inactivity and wait 20 seconds for a response. These values are conservative enough to detect connection issues quickly while avoiding excessive network traffic. The same keepalive parameters are applied to both TLS and insecure (testing-only) connection modes to ensure consistent behavior. Fixes lightninglabs#1814
In this commit, we enhance the server-side gRPC keepalive configuration to work in coordination with the client-side keepalive settings added in the previous commit. This completes the fix for issue lightninglabs#1814 by ensuring both sides of the connection actively maintain connection health. Previously, the server only configured MaxConnectionIdle set to 2 minutes, which would aggressively close idle connections. This caused problems for price oracle connections that could be idle for extended periods between RFQ operations. We now extend MaxConnectionIdle to 24 hours and add active health checking through Time and Timeout parameters. The critical addition is the EnforcementPolicy with PermitWithoutStream set to true. This allows clients to send keepalive pings even when no RPC calls are active, which is essential for long-lived connections. Without this policy, the server would reject client keepalive pings on idle connections, defeating the purpose of the client-side keepalive configuration. These settings follow the same pattern used by lnd and are based on gRPC's official keepalive recommendations. The combination of active pinging from both client and server, along with permissive policies, ensures connections remain healthy and any network issues are detected promptly rather than discovered only when the next RPC fails. Fixes lightninglabs#1814
Summary of ChangesHello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves persistent "connection reset by peer" errors experienced by users of the RFQ system, particularly with price oracle servers. The core issue stemmed from a lack of client-side keepalive mechanisms and an overly aggressive server-side idle connection timeout. By implementing comprehensive bidirectional gRPC keepalive settings, the system will now actively maintain connections, detect network issues promptly, and prevent intermediaries from silently dropping idle connections, leading to a more robust and reliable user experience for RFQ operations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a "connection reset by peer" error in the RFQ system by properly configuring gRPC keep-alive settings on both the client and server. The changes introduce bidirectional keep-alive pings to maintain long-lived idle connections and allow for faster detection of broken connections. The implementation is sound and follows best practices for gRPC keep-alive. I have one suggestion to improve code maintainability by reducing duplication in the client-side configuration.
// Configure client-side keepalive parameters. These settings ensure | ||
// the client actively probes the connection health and prevents idle | ||
// connections from being silently closed by intermediaries or the | ||
// server. | ||
opts = append(opts, grpc.WithKeepaliveParams(keepalive.ClientParameters{ | ||
// Ping server after 30 seconds of inactivity. | ||
Time: 30 * time.Second, | ||
// Wait 20 seconds for ping response. | ||
Timeout: 20 * time.Second, | ||
// Permit keepalive pings even when there are no active | ||
// streams. This is critical for long-lived connections with | ||
// infrequent RFQ requests. | ||
PermitWithoutStream: true, | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for configuring keep-alive parameters is identical to the one in serverDialOpts()
(lines 200-213). To avoid duplication and improve maintainability, consider extracting these options into a package-level variable.
For example, you could define:
var clientKeepaliveDialOption = grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 30 * time.Second,
Timeout: 20 * time.Second,
PermitWithoutStream: true,
})
And then use it in both serverDialOpts()
and insecureServerDialOpts()
like this:
opts = append(opts, clientKeepaliveDialOption)
This would make the code cleaner and easier to modify in the future.
Pull Request Test Coverage Report for Build 18211832298Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. As far as I understand, price oracles are public entities of the network that can potentially be queried by a significant amount of clients. I understand the keepalive client side, but I'm wondering what is the impact on resources for the server side.
// the client actively probes the connection health and prevents idle | ||
// connections from being silently closed by intermediaries or the | ||
// server. | ||
opts = append(opts, grpc.WithKeepaliveParams(keepalive.ClientParameters{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we want this to be a pure copy-paste of the opts on lines 200-214? Can't we reuse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, let's extract some consts or the whole configuration and re-use it
Failing tests are unrelated with the PR. |
// the client actively probes the connection health and prevents idle | ||
// connections from being silently closed by intermediaries or the | ||
// server. | ||
opts = append(opts, grpc.WithKeepaliveParams(keepalive.ClientParameters{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, let's extract some consts or the whole configuration and re-use it
// stay alive during idle periods. | ||
serverKeepalive := keepalive.ServerParameters{ | ||
// Ping client after 1 minute of inactivity. | ||
Time: time.Minute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it redundant to also have the server pinging the client?
Why not just have it be the responsibility of the client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mirrors what lnd does,and we havne't had any issues with it. I view it as defense in depth. Can you think of a reason where the server also pinging will somehow disrupt the connection?
PermitWithoutStream: true, | ||
} | ||
|
||
serverOpts = append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This server is the tapd rpcserver. The oracle server is actually not implemented here so this has been misplaced. We could add it to the reference implementation in docs/examples/basic-price-oracle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is misplaced. The intent is that we resolve this on the tapd
level as well, as we have other streaming calls, and no keep alive, so possible we run into this issue in other instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add it to the example oracle as well.
Problem
Users were experiencing "connection reset by peer" errors when using the RFQ system with price oracle servers (issue #1814). The core problem manifested when:
This created a poor user experience where the first payment attempt after an idle period would always fail, requiring manual retries.
Root Cause
The issue had two contributing factors:
Missing Client-Side Keepalive: The RFQ price oracle client (
rfq/oracle.go
) had no keepalive configuration at all. Without active health probing, the client couldn't detect when connections were broken and wouldn't prevent intermediaries (NATs, load balancers, firewalls) from timing out idle connections.Aggressive Server-Side Timeout: The tapd gRPC server only configured
MaxConnectionIdle: 2 minutes
, which would aggressively close any connection idle for more than 2 minutes. This is far too short for RFQ operations where price queries may be infrequent.Solution
We implemented comprehensive bidirectional keepalive configuration following the same pattern used by lnd:
Client-Side Configuration (
rfq/oracle.go
)Added to both
serverDialOpts()
andinsecureServerDialOpts()
:Key Settings:
Time: 30s
- Client sends a ping after 30 seconds of inactivity. This is frequent enough to keep connections alive through most intermediaries while being conservative with network resources.Timeout: 20s
- If the server doesn't respond to a ping within 20 seconds, the connection is considered dead. This allows prompt detection of network issues.PermitWithoutStream: true
- Critical setting. Allows pings even when no RPC calls are active. Without this, the client wouldn't send pings during idle periods, defeating the entire purpose.Server-Side Configuration (
server.go
)Enhanced the existing minimal configuration with comprehensive settings:
Key Settings:
Time: 1min
- Server actively pings clients every minute when idle. This provides server-initiated health checking.Timeout: 20s
- Server waits 20 seconds for ping responses before considering the connection dead.MaxConnectionIdle: 24h
- Extended from 2 minutes to 24 hours. Allows long-lived connections that are crucial for RFQ operations. The active pinging mechanism (viaTime
parameter) handles health checking, so we don't need aggressive idle timeouts.MinTime: 5s
- Prevents abusive clients from sending pings too frequently (DoS protection).How It Works
With these settings in place:
Fixes #1814