Skip to content

Conversation

@gavin-jeong
Copy link

Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

publish-envoy bot and others added 5 commits September 3, 2025 01:25
Created by Envoy publish bot for @yanavlasov




**Summary of changes**:

* Security fixes:
- Fix for OAuth cookie issue
[CVE-2025-55162](GHSA-95j4-hw7f-v2rh).
- Fix UAF in DNS resolution
[CVE-2025-54588](GHSA-g9vw-6pvx-7gmw).

**Docker images**:
    https://hub.docker.com/r/envoyproxy/envoy/tags?page=1&name=v1.34.6
**Docs**:
    https://www.envoyproxy.io/docs/envoy/v1.34.6/
**Release notes**:

https://www.envoyproxy.io/docs/envoy/v1.34.6/version_history/v1.34/v1.34.6
**Full changelog**:
    envoyproxy/envoy@v1.34.5...v1.34.6
Include code formatting improvements for consistent style in trace test files.
This commit introduces QUIC/HTTP3 keylog functionality in Envoy, enabling generation of NSS Key Log Format files for Wireshark and other debugging tools.

- Keylog callback registration in OnNewSslCtx()
- Implementation of EnvoyQuicProofSource::setupQuicKeylogCallback() and quicKeylogCallback()
- TLS context–based keylog configuration with per–filter chain caching and thread safety
- Address filtering via local/remote IP lists
- Fallback to SSLKEYLOGFILE environment variable for compatibility with existing workflows
- QuicKeylogBridge integration with Envoy’s existing TLS keylog infrastructure
- RawBufferSocket fallback fix in QuicServerTransportSocketFactory::createDownstreamTransportSocket()
- Comprehensive unit tests including edge cases

Signed-off-by: Chanhun Jeong <[email protected]>
@gavin-jeong gavin-jeong force-pushed the update_to_1_34_6 branch 2 times, most recently from 46ebf51 to d230b12 Compare September 30, 2025 07:52
keyolk and others added 2 commits October 1, 2025 09:52
Protect all async callbacks from accessing deallocated cluster members
during destruction by adding is_destroying_ atomic flag checks.

Affected callbacks:
- ClusterRefreshManager callbacks
- DNS resolution callbacks
- Connection event callbacks
- Timer callbacks
- Redis client response callbacks (onResponse, onFailure, onUnexpectedResponse)
- Hostname resolution callbacks

The race condition occurred when callbacks were already queued in the
event loop when cluster destruction began, causing use-after-free access
to parent cluster members like info_, redis_discovery_session_, and
resolve_timer_.

All callbacks now check is_destroying_ with memory_order_acquire before
accessing any parent members, ensuring safe termination during destruction.

Fixes segfaults that occurred when removing Redis service entries.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…ter destruction

Problem: Segmentation faults occur when accessing member pointers in async
callbacks during Redis cluster destruction, even with is_destroying_ flag
checks. This happens because there's a race window between checking the
flag and accessing the pointers.

Solution: Add defensive null checks for all pointer accesses that could
become invalid during destruction:

1. ClusterInfo pointer (info_):
   - Add null checks before all configUpdateStats() calls
   - Use safe access pattern for name() in log statements
   - Locations: startResolveRedis(), updateDnsStats(), DNS callbacks,
     onResponse(), onUnexpectedResponse(), onFailure()

2. DNS Resolver pointer (dns_resolver_):
   - Add null checks in startResolveDns()
   - Add checks in resolveClusterHostnames() and resolveReplicas()
   - Prevents crashes when DNS resolution is initiated during teardown

3. Timer pointer (resolve_timer_):
   - Add null checks before enableTimer() calls
   - Locations: finishClusterHostnameResolution(), onResponse(),
     onUnexpectedResponse(), onFailure()

4. Consistency fix:
   - Line 714: Changed parent_.info() to parent_.info_ to match
     null-checked pattern used elsewhere

The pattern applied throughout:
1. Check is_destroying_ flag with memory_order_acquire
2. Verify each pointer is non-null before dereferencing
3. This dual-check handles the race window safely

This prevents use-after-free crashes during Redis cluster teardown when
async callbacks execute after partial destruction has begun.
The previous fix with null checks still had a race condition window between
checking the pointer and using it. Even with the null check, the shared_ptr
could be reset to null by another thread between the check and use.

Solution: Make local copies of shared_ptr before use. This ensures the
pointer remains valid throughout its usage in the current scope.

Changes:
1. startResolveRedis(): Copy info_ to local variable before use
2. updateDnsStats(): Use local copy of info_
3. DNS callbacks: Use local copy for stats updates
4. onResponse(), onUnexpectedResponse(), onFailure(): Use local copies
5. client_factory_.create(): Check and use local copy of info_

The pattern applied:
  auto info = parent_.info_;  // Make local copy (ref count++)
  if (!info) {                // Check if null
    return;
  }
  info->method();             // Safe to use - won't become null

This prevents the crash at line 376 where info_ was becoming null
between the check and the access, even with memory_order_acquire.
…ing timer callbacks

The 5% crash rate was caused by timer callbacks executing after the
RedisDiscoverySession was destroyed. Even though we checked is_destroying_,
there was a race where:

1. Timer callback fires and enters the lambda
2. Destructor runs and deletes the session (unique_ptr reset)
3. Callback tries to access parent_.is_destroying_ → CRASH (use-after-free)

Solution:
- Move timer creation from constructor to initialize() method
- Capture shared_from_this() in timer lambda instead of raw 'this'
- Call initialize() after RedisDiscoverySession construction completes

This ensures the session object stays alive as long as any timer callback
is queued or executing, preventing the use-after-free.

Pattern changed from:
  resolve_timer_ = dispatcher_.createTimer([this]() { ... });

To:
  auto self = shared_from_this();
  resolve_timer_ = dispatcher_.createTimer([self]() { ... });

This should eliminate the remaining 5% crash rate during Redis cluster
destruction.
…nt reference

CRITICAL FIX: The previous approach had a fatal flaw - callbacks with
shared_from_this() kept the session alive, but the session holds a
reference to the parent RedisCluster. When the parent was destroyed,
accessing parent_.is_destroying_ became use-after-free.

The race condition:
1. Timer callback fires with shared_ptr<Session> (session kept alive)
2. RedisCluster destructor runs and completes
3. Callback tries to check parent_.is_destroying_
4. CRASH - parent object destroyed, reference is dangling

Solution:
- Add parent_destroyed_ atomic flag IN THE SESSION
- Parent sets this flag BEFORE destroying session
- Callbacks check session-owned flag, never access parent directly
- Also simplify all safety checks into helper methods

This is the correct fix for the 5% crash rate when removing Redis services.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants