Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/failover] Simplified synchronous failover mode without retry logic #37496

Open
swar8080 opened this issue Jan 26, 2025 · 3 comments
Open
Labels
connector/failover enhancement New feature or request needs triage New item requiring triage

Comments

@swar8080
Copy link
Contributor

Component(s)

connector/failover

Is your feature request related to a problem? Please describe.

There's an open issue for the loadbalancing exporter where spans are lost when collectors restart or scale. The idea of using the failover connector to handle this was mentioned by the author of the loadbalancing component: #36717 (comment)

We also run into this issue when load balancing spans before they're held in memory for tail sampling.
When load balancing fails it'd be fine to fallback to exporting the spans directly to our observability vendor without sampling.

It looks like failover connector could solve the above problem but I wonder if the complexity of the retry logic is a deterrent for this component's adoption. This component's alpha status and non-trivial implementation makes us wary about using it without thorough testing, just in case it causes bugs with the 99.9%+ happy path. Testing also seems tricky because there could be subtle race conditions like in #36587 that rarely get exercised

Describe the solution you'd like

An option to enable a simpler failover mode where the connector synchronously tries all exporters in priority order until one succeeds or there are none left to try.

This would solve the "load balance for sampling or export without sampling" use-case and also "export or verbosely log the failed telemetry". Not sure if there's other use-cases but it might encourage adoption if there's an easier way to get started without tuning the retry logic.

A downside might be encouraging less resilient collector set-ups where an exporter facing an outage is continually retried and it'd be better to use the existing fail-fast retry logic.

Describe alternatives you've considered

Experimenting with the load balancing exporter and this component to find a configuration that works decently. Just documenting this config could be enough.

Additional context

No response

@swar8080 swar8080 added enhancement New feature or request needs triage New item requiring triage labels Jan 26, 2025
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@akats7
Copy link
Contributor

akats7 commented Jan 26, 2025

Hi @swar8080, thanks for this issue. It’s an interesting idea but my concern is that at high throughput it would add back pressure to the export pipelines if now each export had to go through multiple failed cycles in a failover scenario.

It’s an option but I think similar to the retry logic there would be certain usage patterns that wouldn’t fit this pattern to well.

I have been planning to switch up the retry logic to do something similar to what you described but for only one data point. As in one data point would be sampled for retry eval and would in parallel to the main export pipeline, synchronously go through every higher priority pipeline.

@swar8080
Copy link
Contributor Author

I have been planning to switch up the retry logic to do something similar to what you described but for only one data point. As in one data point would be sampled for retry eval and would in parallel to the main export pipeline, synchronously go through every higher priority pipeline.

Not sure i'm following how this works but will keep an eye out for changes to this component :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connector/failover enhancement New feature or request needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

2 participants