-
Notifications
You must be signed in to change notification settings - Fork 18
Fix network boundary deadlocks #240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I threw Claude Opus 4 at it and it came up with a pretty in-depth analysis, copy pasted below. Its TL;DR is that your "trial and error" fix is absolutely correct. Claude Opus 4 AnalysisNetwork Shuffle Deadlock AnalysisSummaryThis document analyzes a deadlock that occurs when using Root Cause: HTTP/2 Connection-Level Flow ControlBackground: HTTP/2 Flow ControlHTTP/2 implements flow control at two levels (RFC 9113 §5.2):
The h2 crate documentation explicitly states this:
Key Finding: Capacity Release is ExplicitThe h2 library requires explicit capacity release via
Hyper calls // From hyper-1.7.0/src/body/incoming.rs, lines 243-248
// https://github.com/hyperium/hyper/blob/v1.7.0/src/body/incoming.rs#L243-L248
match ready!(h2.poll_data(cx)) {
Some(Ok(bytes)) => {
let _ = h2.flow_control().release_capacity(bytes.len());
// ...
}
}This means The Deadlock MechanismWhen multiple Flight streams target the same server, they share a single HTTP/2 connection (due to channel caching in The deadlock occurs through this sequence:
Why Separate Tasks Fix ThisWith separate tasks, tokio's scheduler ensures fair CPU time allocation to all stream-polling tasks. No single slow stream can starve others: Benefits:
Why Unbounded Channels Are NecessaryBounded channels would reintroduce the deadlock:
Unbounded channels ensure producers never block, allowing continuous HTTP/2 consumption. Memory safety is ensured by integrating with DataFusion's The Fix:
|
| Property | Purpose |
|---|---|
| Separate tasks | Tokio's scheduler ensures fair polling of all streams |
| Unbounded channels | Producers never block, ensuring continuous HTTP/2 consumption |
| Memory reservations | DataFusion's MemoryPool provides backpressure at query level |
| Task lifecycle | Tasks kept alive via closure capture, proper cleanup on drop |
Relevant Dependency Code
h2 (HTTP/2 implementation)
- Flow control documentation: h2/src/lib.rs#L49-L77
- FlowControl struct: h2/src/share.rs#L142-L197
- release_capacity implementation: h2/src/share.rs#L516-L539
- Default window size constant: h2/src/proto/streams/flow_control.rs (65,535 bytes)
hyper (HTTP client/server)
- Capacity release on data poll: hyper/src/body/incoming.rs#L243-L248
- H2 body implementation: hyper/src/body/incoming.rs#L153-L171
tonic (gRPC implementation)
- Connection settings (window sizes): tonic/src/transport/channel/service/connection.rs#L33-L57
- Channel endpoint configuration: tonic/src/transport/channel/endpoint.rs
Configuration Notes
Default HTTP/2 Window Sizes
| Setting | Default Value | Source |
|---|---|---|
| Initial stream window | 65,535 bytes | RFC 9113 §6.9.2 |
| Initial connection window | 65,535 bytes | RFC 9113 §6.9.2 |
Tonic Configuration Options
Tonic allows configuring window sizes via Endpoint:
Channel::from_static("http://example.com")
.initial_stream_window_size(some_value)
.initial_connection_window_size(some_value)However, increasing window sizes only delays the problem—it doesn't eliminate the fundamental fairness issue with select_all.
Conclusion
The spawn_select_all implementation is the correct solution, not a workaround. It follows the established pattern of isolating network I/O into independent tasks to prevent flow control issues and ensure fair scheduling.
Key Takeaways
- HTTP/2 connection-level flow control creates implicit dependencies between streams on the same connection
- The h2/hyper stack requires explicit capacity release, which only happens when data is polled
futures::stream::select_allis unsuitable for HTTP/2 streams that share a connection due to potential starvation- Separate tasks + unbounded channels (with memory tracking) is the robust solution
I don't fully understand why this happens, but as seen in #228, deadlocks were the queries hang forever happen very oftenly when dealing with TPCH benchmarks in a real environment (cluster of EC2 machines and parquet files in S3).
By trial an error, I saw that this PR fixes them, but I don't fully understand why, I think what we were doing before should be completely fine.
There's likely something else going on in either the ArrowFlight client or server networking code, something like an async task not awaking when it should, that's causing this.
As this fixes the error, I propose shipping this and keep investigating why this happens.