Skip to content

Consul GitHub Issue: API Gateway xDS snapshot generation fails during service catalog churn #5258

@pgomes-lgc

Description

@pgomes-lgc

Consul GitHub Issue: API Gateway xDS snapshot generation fails during service catalog churn

Consul Version: 1.22.6

Environment: Kubernetes (EKS), Consul deployed via Helm, using Consul API Gateway with ~70+ upstream services routed via HTTPRoutes.

Problem:

When multiple Consul mesh services deregister and re-register simultaneously (e.g. due to node loss, spot interruptions, or rolling deployments), the Consul server fails to generate xDS
snapshots for the API Gateway proxy. This causes the API Gateway's envoy to lose routing to upstream services, resulting in 503 errors — including to services that were never disrupted.

Error messages on Consul server:

[ERROR] agent.envoy: Error handling ADS delta stream: xdsVersion=v3 error="rpc error: code = Unavailable desc = failed to generate all xDS resources from the snapshot: failed to generate
xDS resources for "type.googleapis.com/envoy.config.listener.v3.Listener": missing route for routeRef http-route:-route"

[WARN] agent.envoy.xds: could not find discovery chain for gateway upstream: service_id=api-gateway-xxx xdsVersion=v3 upstream=

Envoy stats on the API Gateway pod confirm persistent update failures:

cluster_manager.cds.update_failure: 91
listener_manager.lds.update_failure: 91
http.ingress_upstream_8080.rds.update_failure: 91
cluster_manager.cds.update_rejected: 0
listener_manager.lds.update_rejected: 0

Zero rejections — envoy never receives a config to NACK. The server-side snapshot generation fails entirely.

Impact:

  • The API Gateway loses routes to all upstream services, not just the ones being deregistered
  • Services with healthy pods and multiple replicas return 503s because the gateway can't route to them
  • The failure is transient but can last 10-30+ seconds depending on the volume of catalog changes
  • In our environment, a single spot node termination (24 mesh pods) caused 503s across 10+ unrelated services

Reproduction:

This occurs during any significant catalog churn — spot interruptions, rolling deployments, HPA scale-down events. It happens consistently
Even with Consul server CPU doubled (500m → 1000m) and CONSUL_GRPC_MAX_CONCURRENT_STREAMS increased (256 → 512), the missing route for routeRef errors persist during catalog churn. The CPU increase eliminated the errors during steady state but not during bursts.

Expected behaviour:

When the Consul server cannot resolve a discovery chain for one upstream during snapshot generation, it should either:

  1. Serve a partial xDS update excluding the affected upstream, rather than failing the entire snapshot
  2. Retry the snapshot generation after a short delay
  3. Allow envoy to NACK and retain its previous working configuration

Currently, a single missing route causes the entire listener/cluster/route xDS generation to fail, taking down routing to all upstreams on the gateway.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Is your feature request related to a problem? Please describe.

Feature Description

Use Case(s)

Contributions

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions