Skip to content

Conversation

Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented Oct 3, 2025

Problem

Users were experiencing "connection reset by peer" errors when using the RFQ system with price oracle servers (issue #1814). The core problem manifested when:

  1. An edge node establishes a connection to a price oracle server
  2. The connection sits idle between RFQ price queries (potentially for minutes or hours)
  3. The network layer silently closes the idle connection without sending a proper GOAWAY frame
  4. The client remains unaware of the broken connection
  5. The next RFQ request fails immediately with "connection reset by peer"
  6. A retry succeeds because it establishes a fresh connection

This created a poor user experience where the first payment attempt after an idle period would always fail, requiring manual retries.

Root Cause

The issue had two contributing factors:

Missing Client-Side Keepalive: The RFQ price oracle client (rfq/oracle.go) had no keepalive configuration at all. Without active health probing, the client couldn't detect when connections were broken and wouldn't prevent intermediaries (NATs, load balancers, firewalls) from timing out idle connections.

Aggressive Server-Side Timeout: The tapd gRPC server only configured MaxConnectionIdle: 2 minutes, which would aggressively close any connection idle for more than 2 minutes. This is far too short for RFQ operations where price queries may be infrequent.

Solution

We implemented comprehensive bidirectional keepalive configuration following the same pattern used by lnd:

Client-Side Configuration (rfq/oracle.go)

Added to both serverDialOpts() and insecureServerDialOpts():

grpc.WithKeepaliveParams(keepalive.ClientParameters{
    Time:                30 * time.Second,  // Ping every 30s when idle
    Timeout:             20 * time.Second,  // Wait 20s for ping response
    PermitWithoutStream: true,              // Ping even without active RPCs
})

Key Settings:

  • Time: 30s - Client sends a ping after 30 seconds of inactivity. This is frequent enough to keep connections alive through most intermediaries while being conservative with network resources.
  • Timeout: 20s - If the server doesn't respond to a ping within 20 seconds, the connection is considered dead. This allows prompt detection of network issues.
  • PermitWithoutStream: true - Critical setting. Allows pings even when no RPC calls are active. Without this, the client wouldn't send pings during idle periods, defeating the entire purpose.

Server-Side Configuration (server.go)

Enhanced the existing minimal configuration with comprehensive settings:

serverKeepalive := keepalive.ServerParameters{
    Time:              time.Minute,      // Ping clients every 1min when idle
    Timeout:           20 * time.Second, // Wait 20s for ping response
    MaxConnectionIdle: time.Hour * 24,   // Allow 24h of idle time
}

clientKeepalive := keepalive.EnforcementPolicy{
    MinTime:             5 * time.Second, // Min time between client pings
    PermitWithoutStream: true,            // Accept client pings without active RPCs
}

Key Settings:

  • Time: 1min - Server actively pings clients every minute when idle. This provides server-initiated health checking.
  • Timeout: 20s - Server waits 20 seconds for ping responses before considering the connection dead.
  • MaxConnectionIdle: 24h - Extended from 2 minutes to 24 hours. Allows long-lived connections that are crucial for RFQ operations. The active pinging mechanism (via Time parameter) handles health checking, so we don't need aggressive idle timeouts.
  • EnforcementPolicy.PermitWithoutStream: true - Critical setting. Tells the server to accept keepalive pings from clients even when there are no active RPC calls. Without this, the server would reject client pings on idle connections with a GOAWAY, breaking the client-side keepalive mechanism.
  • MinTime: 5s - Prevents abusive clients from sending pings too frequently (DoS protection).

How It Works

With these settings in place:

  1. During Idle Periods: Both client and server send periodic pings (client every 30s, server every 1min) to verify the connection is still alive
  2. Network Issue Detection: If a ping doesn't get a response within 20s, the connection is immediately closed and the application is notified
  3. Intermediary Keepalive: The regular pings prevent NATs, load balancers, and firewalls from timing out the connection
  4. No Silent Failures: Connections are never silently broken - either they work or they fail fast with proper notification

Fixes #1814

In this commit, we add comprehensive client-side gRPC keepalive
parameters to the price oracle client connections. This addresses the
root cause of issue lightninglabs#1814 where connections to price oracle servers
were being silently closed after idle periods, resulting in "connection
reset by peer" errors during RFQ operations.

The key change is adding PermitWithoutStream set to true, which allows
the client to send keepalive pings even when there are no active RPC
calls. This is essential for long-lived connections that may experience
extended idle periods between price queries. Without this setting, idle
connections would be closed by intermediaries or the server itself,
leaving the client unaware of the broken connection until the next RPC
attempt.

We configure the client to ping the server every 30 seconds of
inactivity and wait 20 seconds for a response. These values are
conservative enough to detect connection issues quickly while avoiding
excessive network traffic. The same keepalive parameters are applied to
both TLS and insecure (testing-only) connection modes to ensure
consistent behavior.

Fixes lightninglabs#1814
In this commit, we enhance the server-side gRPC keepalive configuration
to work in coordination with the client-side keepalive settings added
in the previous commit. This completes the fix for issue lightninglabs#1814 by
ensuring both sides of the connection actively maintain connection
health.

Previously, the server only configured MaxConnectionIdle set to 2
minutes, which would aggressively close idle connections. This caused
problems for price oracle connections that could be idle for extended
periods between RFQ operations. We now extend MaxConnectionIdle to 24
hours and add active health checking through Time and Timeout
parameters.

The critical addition is the EnforcementPolicy with PermitWithoutStream
set to true. This allows clients to send keepalive pings even when no
RPC calls are active, which is essential for long-lived connections.
Without this policy, the server would reject client keepalive pings on
idle connections, defeating the purpose of the client-side keepalive
configuration.

These settings follow the same pattern used by lnd and are based on
gRPC's official keepalive recommendations. The combination of active
pinging from both client and server, along with permissive policies,
ensures connections remain healthy and any network issues are detected
promptly rather than discovered only when the next RPC fails.

Fixes lightninglabs#1814
@Roasbeef Roasbeef added this to the v0.7 milestone Oct 3, 2025
Copy link

Summary of Changes

Hello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves persistent "connection reset by peer" errors experienced by users of the RFQ system, particularly with price oracle servers. The core issue stemmed from a lack of client-side keepalive mechanisms and an overly aggressive server-side idle connection timeout. By implementing comprehensive bidirectional gRPC keepalive settings, the system will now actively maintain connections, detect network issues promptly, and prevent intermediaries from silently dropping idle connections, leading to a more robust and reliable user experience for RFQ operations.

Highlights

  • Client-Side Keepalive Configuration: Implemented grpc.WithKeepaliveParams in rfq/oracle.go for both secure and insecure dial options, setting a 30-second ping interval, 20-second timeout, and crucially, PermitWithoutStream: true to maintain connections during idle periods.
  • Server-Side Keepalive Configuration: Updated server.go to use comprehensive keepalive.ServerParameters with a 1-minute ping interval, 20-second timeout, and an extended MaxConnectionIdle of 24 hours, replacing the previous aggressive 2-minute timeout.
  • Server-Side Client Keepalive Enforcement: Added keepalive.EnforcementPolicy to the server, allowing it to accept client pings even without active RPCs (PermitWithoutStream: true) and setting a MinTime of 5 seconds between client pings for abuse prevention.
  • Resolution of "Connection Reset by Peer" Errors: These changes collectively address the "connection reset by peer" errors by ensuring active health probing and preventing network intermediaries from silently closing idle gRPC connections, significantly improving the reliability of RFQ price queries.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a "connection reset by peer" error in the RFQ system by properly configuring gRPC keep-alive settings on both the client and server. The changes introduce bidirectional keep-alive pings to maintain long-lived idle connections and allow for faster detection of broken connections. The implementation is sound and follows best practices for gRPC keep-alive. I have one suggestion to improve code maintainability by reducing duplication in the client-side configuration.

Comment on lines +228 to +241
// Configure client-side keepalive parameters. These settings ensure
// the client actively probes the connection health and prevents idle
// connections from being silently closed by intermediaries or the
// server.
opts = append(opts, grpc.WithKeepaliveParams(keepalive.ClientParameters{
// Ping server after 30 seconds of inactivity.
Time: 30 * time.Second,
// Wait 20 seconds for ping response.
Timeout: 20 * time.Second,
// Permit keepalive pings even when there are no active
// streams. This is critical for long-lived connections with
// infrequent RFQ requests.
PermitWithoutStream: true,
}))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for configuring keep-alive parameters is identical to the one in serverDialOpts() (lines 200-213). To avoid duplication and improve maintainability, consider extracting these options into a package-level variable.

For example, you could define:

var clientKeepaliveDialOption = grpc.WithKeepaliveParams(keepalive.ClientParameters{
	Time:                30 * time.Second,
	Timeout:             20 * time.Second,
	PermitWithoutStream: true,
})

And then use it in both serverDialOpts() and insecureServerDialOpts() like this:

opts = append(opts, clientKeepaliveDialOption)

This would make the code cleaner and easier to modify in the future.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 18211832298

Details

  • 43 of 58 (74.14%) changed or added relevant lines in 2 files are covered.
  • 23 unchanged lines in 8 files lost coverage.
  • Overall coverage increased (+0.02%) to 49.671%

Changes Missing Coverage Covered Lines Changed/Added Lines %
rfq/oracle.go 15 30 50.0%
Files with Coverage Reduction New Missed Lines %
tapdb/assets_common.go 2 72.13%
tapdb/assets_store.go 2 77.77%
tapgarden/custodian.go 2 71.73%
universe/supplyverifier/manager.go 2 64.53%
itest/multisig.go 3 97.94%
rpcserver.go 4 61.31%
tapdb/multiverse.go 4 76.2%
tapdb/universe.go 4 57.23%
Totals Coverage Status
Change from base Build 18145713853: 0.02%
Covered Lines: 55858
Relevant Lines: 112456

💛 - Coveralls

@Roasbeef Roasbeef requested review from a team, darioAnongba and GeorgeTsagk and removed request for a team October 3, 2025 21:15
Copy link
Contributor

@darioAnongba darioAnongba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As far as I understand, price oracles are public entities of the network that can potentially be queried by a significant amount of clients. I understand the keepalive client side, but I'm wondering what is the impact on resources for the server side.

// the client actively probes the connection health and prevents idle
// connections from being silently closed by intermediaries or the
// server.
opts = append(opts, grpc.WithKeepaliveParams(keepalive.ClientParameters{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we want this to be a pure copy-paste of the opts on lines 200-214? Can't we reuse?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's extract some consts or the whole configuration and re-use it

@darioAnongba
Copy link
Contributor

Failing tests are unrelated with the PR.

// the client actively probes the connection health and prevents idle
// connections from being silently closed by intermediaries or the
// server.
opts = append(opts, grpc.WithKeepaliveParams(keepalive.ClientParameters{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's extract some consts or the whole configuration and re-use it

// stay alive during idle periods.
serverKeepalive := keepalive.ServerParameters{
// Ping client after 1 minute of inactivity.
Time: time.Minute,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it redundant to also have the server pinging the client?

Why not just have it be the responsibility of the client

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors what lnd does,and we havne't had any issues with it. I view it as defense in depth. Can you think of a reason where the server also pinging will somehow disrupt the connection?

PermitWithoutStream: true,
}

serverOpts = append(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This server is the tapd rpcserver. The oracle server is actually not implemented here so this has been misplaced. We could add it to the reference implementation in docs/examples/basic-price-oracle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is misplaced. The intent is that we resolve this on the tapd level as well, as we have other streaming calls, and no keep alive, so possible we run into this issue in other instances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add it to the example oracle as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: Issue with Paying Sats Invoices Using Taproot Assets via sendPayment with rfqId
4 participants