Consul GitHub Issue: API Gateway xDS snapshot generation fails during service catalog churn
Consul Version: 1.22.6
Environment: Kubernetes (EKS), Consul deployed via Helm, using Consul API Gateway with ~70+ upstream services routed via HTTPRoutes.
Problem:
When multiple Consul mesh services deregister and re-register simultaneously (e.g. due to node loss, spot interruptions, or rolling deployments), the Consul server fails to generate xDS
snapshots for the API Gateway proxy. This causes the API Gateway's envoy to lose routing to upstream services, resulting in 503 errors — including to services that were never disrupted.
Error messages on Consul server:
[ERROR] agent.envoy: Error handling ADS delta stream: xdsVersion=v3 error="rpc error: code = Unavailable desc = failed to generate all xDS resources from the snapshot: failed to generate
xDS resources for "type.googleapis.com/envoy.config.listener.v3.Listener": missing route for routeRef http-route:-route"
[WARN] agent.envoy.xds: could not find discovery chain for gateway upstream: service_id=api-gateway-xxx xdsVersion=v3 upstream=
Envoy stats on the API Gateway pod confirm persistent update failures:
cluster_manager.cds.update_failure: 91
listener_manager.lds.update_failure: 91
http.ingress_upstream_8080.rds.update_failure: 91
cluster_manager.cds.update_rejected: 0
listener_manager.lds.update_rejected: 0
Zero rejections — envoy never receives a config to NACK. The server-side snapshot generation fails entirely.
Impact:
- The API Gateway loses routes to all upstream services, not just the ones being deregistered
- Services with healthy pods and multiple replicas return 503s because the gateway can't route to them
- The failure is transient but can last 10-30+ seconds depending on the volume of catalog changes
- In our environment, a single spot node termination (24 mesh pods) caused 503s across 10+ unrelated services
Reproduction:
This occurs during any significant catalog churn — spot interruptions, rolling deployments, HPA scale-down events. It happens consistently
Even with Consul server CPU doubled (500m → 1000m) and CONSUL_GRPC_MAX_CONCURRENT_STREAMS increased (256 → 512), the missing route for routeRef errors persist during catalog churn. The CPU increase eliminated the errors during steady state but not during bursts.
Expected behaviour:
When the Consul server cannot resolve a discovery chain for one upstream during snapshot generation, it should either:
- Serve a partial xDS update excluding the affected upstream, rather than failing the entire snapshot
- Retry the snapshot generation after a short delay
- Allow envoy to NACK and retain its previous working configuration
Currently, a single missing route causes the entire listener/cluster/route xDS generation to fail, taking down routing to all upstreams on the gateway.
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
- Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
- If you are interested in working on this issue or have submitted a pull request, please leave a comment.
Is your feature request related to a problem? Please describe.
Feature Description
Use Case(s)
Contributions
Consul GitHub Issue: API Gateway xDS snapshot generation fails during service catalog churn
Consul Version: 1.22.6
Environment: Kubernetes (EKS), Consul deployed via Helm, using Consul API Gateway with ~70+ upstream services routed via HTTPRoutes.
Problem:
When multiple Consul mesh services deregister and re-register simultaneously (e.g. due to node loss, spot interruptions, or rolling deployments), the Consul server fails to generate xDS
snapshots for the API Gateway proxy. This causes the API Gateway's envoy to lose routing to upstream services, resulting in 503 errors — including to services that were never disrupted.
Error messages on Consul server:
[ERROR] agent.envoy: Error handling ADS delta stream: xdsVersion=v3 error="rpc error: code = Unavailable desc = failed to generate all xDS resources from the snapshot: failed to generate
xDS resources for "type.googleapis.com/envoy.config.listener.v3.Listener": missing route for routeRef http-route:-route"
[WARN] agent.envoy.xds: could not find discovery chain for gateway upstream: service_id=api-gateway-xxx xdsVersion=v3 upstream=
Envoy stats on the API Gateway pod confirm persistent update failures:
cluster_manager.cds.update_failure: 91
listener_manager.lds.update_failure: 91
http.ingress_upstream_8080.rds.update_failure: 91
cluster_manager.cds.update_rejected: 0
listener_manager.lds.update_rejected: 0
Zero rejections — envoy never receives a config to NACK. The server-side snapshot generation fails entirely.
Impact:
Reproduction:
This occurs during any significant catalog churn — spot interruptions, rolling deployments, HPA scale-down events. It happens consistently
Even with Consul server CPU doubled (500m → 1000m) and CONSUL_GRPC_MAX_CONCURRENT_STREAMS increased (256 → 512), the missing route for routeRef errors persist during catalog churn. The CPU increase eliminated the errors during steady state but not during bursts.
Expected behaviour:
When the Consul server cannot resolve a discovery chain for one upstream during snapshot generation, it should either:
Currently, a single missing route causes the entire listener/cluster/route xDS generation to fail, taking down routing to all upstreams on the gateway.
Community Note
Is your feature request related to a problem? Please describe.
Feature Description
Use Case(s)
Contributions