[connector/failover] Simplified synchronous failover mode without retry logic #37496

swar8080 · 2025-01-26T20:36:57Z

Component(s)

connector/failover

Is your feature request related to a problem? Please describe.

There's an open issue for the loadbalancing exporter where spans are lost when collectors restart or scale. The idea of using the failover connector to handle this was mentioned by the author of the loadbalancing component: #36717 (comment)

We also run into this issue when load balancing spans before they're held in memory for tail sampling.
When load balancing fails it'd be fine to fallback to exporting the spans directly to our observability vendor without sampling.

It looks like failover connector could solve the above problem but I wonder if the complexity of the retry logic is a deterrent for this component's adoption. This component's alpha status and non-trivial implementation makes us wary about using it without thorough testing, just in case it causes bugs with the 99.9%+ happy path. Testing also seems tricky because there could be subtle race conditions like in #36587 that rarely get exercised

Describe the solution you'd like

An option to enable a simpler failover mode where the connector synchronously tries all exporters in priority order until one succeeds or there are none left to try.

This would solve the "load balance for sampling or export without sampling" use-case and also "export or verbosely log the failed telemetry". Not sure if there's other use-cases but it might encourage adoption if there's an easier way to get started without tuning the retry logic.

A downside might be encouraging less resilient collector set-ups where an exporter facing an outage is continually retried and it'd be better to use the existing fail-fast retry logic.

Describe alternatives you've considered

Experimenting with the load balancing exporter and this component to find a configuration that works decently. Just documenting this config could be enough.

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2025-01-26T20:37:13Z

Pinging code owners:

connector/failover: @akats7 @fatsheep9146

See Adding Labels via Comments if you do not have permissions to add labels yourself.

akats7 · 2025-01-26T20:54:15Z

Hi @swar8080, thanks for this issue. It’s an interesting idea but my concern is that at high throughput it would add back pressure to the export pipelines if now each export had to go through multiple failed cycles in a failover scenario.

It’s an option but I think similar to the retry logic there would be certain usage patterns that wouldn’t fit this pattern to well.

I have been planning to switch up the retry logic to do something similar to what you described but for only one data point. As in one data point would be sampled for retry eval and would in parallel to the main export pipeline, synchronously go through every higher priority pipeline.

swar8080 · 2025-01-26T21:23:58Z

I have been planning to switch up the retry logic to do something similar to what you described but for only one data point. As in one data point would be sampled for retry eval and would in parallel to the main export pipeline, synchronously go through every higher priority pipeline.

Not sure i'm following how this works but will keep an eye out for changes to this component :)

swar8080 added enhancement New feature or request needs triage New item requiring triage labels Jan 26, 2025

github-actions bot added the connector/failover label Jan 26, 2025

github-actions bot mentioned this issue Jan 28, 2025

Weekly Report: 2025-01-21 - 2025-01-28 #37519

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[connector/failover] Simplified synchronous failover mode without retry logic #37496

[connector/failover] Simplified synchronous failover mode without retry logic #37496

swar8080 commented Jan 26, 2025

github-actions bot commented Jan 26, 2025

akats7 commented Jan 26, 2025

swar8080 commented Jan 26, 2025