rfq+server: properly configure gRPC keep alive settings for client+server #1834

Roasbeef · 2025-10-03T03:16:43Z

Problem

Users were experiencing "connection reset by peer" errors when using the RFQ system with price oracle servers (issue #1814). The core problem manifested when:

An edge node establishes a connection to a price oracle server
The connection sits idle between RFQ price queries (potentially for minutes or hours)
The network layer silently closes the idle connection without sending a proper GOAWAY frame
The client remains unaware of the broken connection
The next RFQ request fails immediately with "connection reset by peer"
A retry succeeds because it establishes a fresh connection

This created a poor user experience where the first payment attempt after an idle period would always fail, requiring manual retries.

Root Cause

The issue had two contributing factors:

Missing Client-Side Keepalive: The RFQ price oracle client (rfq/oracle.go) had no keepalive configuration at all. Without active health probing, the client couldn't detect when connections were broken and wouldn't prevent intermediaries (NATs, load balancers, firewalls) from timing out idle connections.

Aggressive Server-Side Timeout: The tapd gRPC server only configured MaxConnectionIdle: 2 minutes, which would aggressively close any connection idle for more than 2 minutes. This is far too short for RFQ operations where price queries may be infrequent.

Solution

We implemented comprehensive bidirectional keepalive configuration following the same pattern used by lnd:

Client-Side Configuration (`rfq/oracle.go`)

Added to both serverDialOpts() and insecureServerDialOpts():

grpc.WithKeepaliveParams(keepalive.ClientParameters{
    Time:                30 * time.Second,  // Ping every 30s when idle
    Timeout:             20 * time.Second,  // Wait 20s for ping response
    PermitWithoutStream: true,              // Ping even without active RPCs
})

Key Settings:

Time: 30s - Client sends a ping after 30 seconds of inactivity. This is frequent enough to keep connections alive through most intermediaries while being conservative with network resources.
Timeout: 20s - If the server doesn't respond to a ping within 20 seconds, the connection is considered dead. This allows prompt detection of network issues.
PermitWithoutStream: true - Critical setting. Allows pings even when no RPC calls are active. Without this, the client wouldn't send pings during idle periods, defeating the entire purpose.

Server-Side Configuration (`server.go`)

Enhanced the existing minimal configuration with comprehensive settings:

serverKeepalive := keepalive.ServerParameters{
    Time:              time.Minute,      // Ping clients every 1min when idle
    Timeout:           20 * time.Second, // Wait 20s for ping response
    MaxConnectionIdle: time.Hour * 24,   // Allow 24h of idle time
}

clientKeepalive := keepalive.EnforcementPolicy{
    MinTime:             5 * time.Second, // Min time between client pings
    PermitWithoutStream: true,            // Accept client pings without active RPCs
}

Key Settings:

Time: 1min - Server actively pings clients every minute when idle. This provides server-initiated health checking.
Timeout: 20s - Server waits 20 seconds for ping responses before considering the connection dead.
MaxConnectionIdle: 24h - Extended from 2 minutes to 24 hours. Allows long-lived connections that are crucial for RFQ operations. The active pinging mechanism (via Time parameter) handles health checking, so we don't need aggressive idle timeouts.
EnforcementPolicy.PermitWithoutStream: true - Critical setting. Tells the server to accept keepalive pings from clients even when there are no active RPC calls. Without this, the server would reject client pings on idle connections with a GOAWAY, breaking the client-side keepalive mechanism.
MinTime: 5s - Prevents abusive clients from sending pings too frequently (DoS protection).

How It Works

With these settings in place:

During Idle Periods: Both client and server send periodic pings (client every 30s, server every 1min) to verify the connection is still alive
Network Issue Detection: If a ping doesn't get a response within 20s, the connection is immediately closed and the application is notified
Intermediary Keepalive: The regular pings prevent NATs, load balancers, and firewalls from timing out the connection
No Silent Failures: Connections are never silently broken - either they work or they fail fast with proper notification

Fixes #1814

gemini-code-assist · 2025-10-03T03:16:58Z

Summary of Changes

Hello @Roasbeef, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves persistent "connection reset by peer" errors experienced by users of the RFQ system, particularly with price oracle servers. The core issue stemmed from a lack of client-side keepalive mechanisms and an overly aggressive server-side idle connection timeout. By implementing comprehensive bidirectional gRPC keepalive settings, the system will now actively maintain connections, detect network issues promptly, and prevent intermediaries from silently dropping idle connections, leading to a more robust and reliable user experience for RFQ operations.

Highlights

Client-Side Keepalive Configuration: Implemented grpc.WithKeepaliveParams in rfq/oracle.go for both secure and insecure dial options, setting a 30-second ping interval, 20-second timeout, and crucially, PermitWithoutStream: true to maintain connections during idle periods.
Server-Side Keepalive Configuration: Updated server.go to use comprehensive keepalive.ServerParameters with a 1-minute ping interval, 20-second timeout, and an extended MaxConnectionIdle of 24 hours, replacing the previous aggressive 2-minute timeout.
Server-Side Client Keepalive Enforcement: Added keepalive.EnforcementPolicy to the server, allowing it to accept client pings even without active RPCs (PermitWithoutStream: true) and setting a MinTime of 5 seconds between client pings for abuse prevention.
Resolution of "Connection Reset by Peer" Errors: These changes collectively address the "connection reset by peer" errors by ensuring active health probing and preventing network intermediaries from silently closing idle gRPC connections, significantly improving the reliability of RFQ price queries.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a "connection reset by peer" error in the RFQ system by properly configuring gRPC keep-alive settings on both the client and server. The changes introduce bidirectional keep-alive pings to maintain long-lived idle connections and allow for faster detection of broken connections. The implementation is sound and follows best practices for gRPC keep-alive. I have one suggestion to improve code maintainability by reducing duplication in the client-side configuration.

rfq/oracle.go

coveralls · 2025-10-03T03:41:28Z

Pull Request Test Coverage Report for Build 18384444381

Details

33 of 33 (100.0%) changed or added relevant lines in 2 files are covered.
18 unchanged lines in 6 files lost coverage.
Overall coverage increased (+6.9%) to 56.555%

Files with Coverage Reduction	New Missed Lines	%
mssmt/compacted_tree.go	2	77.65%
tapdb/mssmt.go	2	90.45%
tapdb/multiverse.go	2	81.78%
universe/supplyverifier/manager.go	2	64.53%
tapdb/assets_store.go	4	79.82%
universe/archive.go	6	78.95%

Totals
Change from base Build 18359865357:	6.9%
Covered Lines:	63967
Relevant Lines:	113105

💛 - Coveralls

darioAnongba

LGTM. As far as I understand, price oracles are public entities of the network that can potentially be queried by a significant amount of clients. I understand the keepalive client side, but I'm wondering what is the impact on resources for the server side.

rfq/oracle.go

darioAnongba · 2025-10-06T09:30:28Z

Failing tests are unrelated with the PR.

rfq/oracle.go

server.go

In this commit, we add comprehensive client-side gRPC keepalive parameters to the price oracle client connections. This addresses the root cause of issue lightninglabs#1814 where connections to price oracle servers were being silently closed after idle periods, resulting in "connection reset by peer" errors during RFQ operations. The key change is adding PermitWithoutStream set to true, which allows the client to send keepalive pings even when there are no active RPC calls. This is essential for long-lived connections that may experience extended idle periods between price queries. Without this setting, idle connections would be closed by intermediaries or the server itself, leaving the client unaware of the broken connection until the next RPC attempt. We configure the client to ping the server every 30 seconds of inactivity and wait 20 seconds for a response. These values are conservative enough to detect connection issues quickly while avoiding excessive network traffic. The same keepalive parameters are applied to both TLS and insecure (testing-only) connection modes to ensure consistent behavior. Fixes lightninglabs#1814

In this commit, we enhance the server-side gRPC keepalive configuration to work in coordination with the client-side keepalive settings added in the previous commit. This completes the fix for issue lightninglabs#1814 by ensuring both sides of the connection actively maintain connection health. Previously, the server only configured MaxConnectionIdle set to 2 minutes, which would aggressively close idle connections. This caused problems for price oracle connections that could be idle for extended periods between RFQ operations. We now extend MaxConnectionIdle to 24 hours and add active health checking through Time and Timeout parameters. The critical addition is the EnforcementPolicy with PermitWithoutStream set to true. This allows clients to send keepalive pings even when no RPC calls are active, which is essential for long-lived connections. Without this policy, the server would reject client keepalive pings on idle connections, defeating the purpose of the client-side keepalive configuration. These settings follow the same pattern used by lnd and are based on gRPC's official keepalive recommendations. The combination of active pinging from both client and server, along with permissive policies, ensures connections remain healthy and any network issues are detected promptly rather than discovered only when the next RPC fails. Fixes lightninglabs#1814

Roasbeef · 2025-10-09T18:05:43Z

I understand the keepalive client side, but I'm wondering what is the impact on resources for the server side.

IIUC it's relative minor as the keep alive settings using the PING frame on the HTTP/2 layer, which is usually 8 bytes. So if we ping every 60 seconds, then 8 bytes a minute, which is ~11 KB a day.

Roasbeef · 2025-10-09T18:05:48Z

@GeorgeTsagk PTAL

GeorgeTsagk

LGTM!

Roasbeef added this to the v0.7 milestone Oct 3, 2025

Roasbeef added gRPC bug fix labels Oct 3, 2025

Roasbeef mentioned this pull request Oct 3, 2025

[bug]: Issue with Paying Sats Invoices Using Taproot Assets via sendPayment with rfqId #1814

Closed

gemini-code-assist bot reviewed Oct 3, 2025

View reviewed changes

rfq/oracle.go Outdated Show resolved Hide resolved

Roasbeef requested review from a team, GeorgeTsagk and darioAnongba and removed request for a team October 3, 2025 21:15

darioAnongba approved these changes Oct 6, 2025

View reviewed changes

rfq/oracle.go Outdated Show resolved Hide resolved

GeorgeTsagk requested changes Oct 6, 2025

View reviewed changes

rfq/oracle.go Outdated Show resolved Hide resolved

server.go Show resolved Hide resolved

server.go Show resolved Hide resolved

ffranr assigned Roasbeef Oct 7, 2025

levmi added this to Taproot-Assets Project Board Oct 9, 2025

levmi moved this to 👀 In review in Taproot-Assets Project Board Oct 9, 2025

Roasbeef added 4 commits October 9, 2025 18:46

docs/examples/basic-price-oracle: add keep alive settings to example

48dd110

docs/release-notes: add release notes entry

ba07042

Roasbeef force-pushed the rfq-keep-alive branch from 9976595 to ba07042 Compare October 9, 2025 17:47

Roasbeef requested a review from GeorgeTsagk October 9, 2025 17:51

GeorgeTsagk approved these changes Oct 10, 2025

View reviewed changes

GeorgeTsagk added this pull request to the merge queue Oct 10, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 10, 2025

ffranr added this pull request to the merge queue Oct 10, 2025

Merged via the queue into lightninglabs:main with commit 6928aa6 Oct 10, 2025
19 checks passed

github-project-automation bot moved this from 👀 In review to ✅ Done in Taproot-Assets Project Board Oct 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rfq+server: properly configure gRPC keep alive settings for client+server #1834

rfq+server: properly configure gRPC keep alive settings for client+server #1834

Uh oh!

Roasbeef commented Oct 3, 2025

Uh oh!

gemini-code-assist bot commented Oct 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

coveralls commented Oct 3, 2025 •

edited

Loading

Uh oh!

darioAnongba left a comment

Uh oh!

Uh oh!

darioAnongba commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Roasbeef commented Oct 9, 2025

Uh oh!

Roasbeef commented Oct 9, 2025

Uh oh!

GeorgeTsagk left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rfq+server: properly configure gRPC keep alive settings for client+server #1834

rfq+server: properly configure gRPC keep alive settings for client+server #1834

Uh oh!

Conversation

Roasbeef commented Oct 3, 2025

Problem

Root Cause

Solution

Client-Side Configuration (rfq/oracle.go)

Server-Side Configuration (server.go)

How It Works

Uh oh!

gemini-code-assist bot commented Oct 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coveralls commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 18384444381

Details

💛 - Coveralls

Uh oh!

darioAnongba left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

darioAnongba commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Roasbeef commented Oct 9, 2025

Uh oh!

Roasbeef commented Oct 9, 2025

Uh oh!

GeorgeTsagk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Client-Side Configuration (`rfq/oracle.go`)

Server-Side Configuration (`server.go`)

coveralls commented Oct 3, 2025 •

edited

Loading