-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loadbalancing exporter blocking the process of consume traces when loadbalancer backends change #8843
Comments
Good catch! @bogdandrutu, @codeboten, @tigrannajaryan, do you think calling an exporter's shutdown on a separate Go routine would be in line with the function semantics? |
Side comment: it is quite confusing that an exporter manages other exporters. We need to improve our pipelines so that this can be a processor. As for the specifics the shutdown calls/implementation. I believe Shutdown() itself is implemented incorrectly. It sets a flag and exits where the Shutdown contract says:
So, I think the Shutdown itself needs to be fixed first of all.
If the above contract about Shutdown's behavior is fulfilled I think it does not matter whether Shutdown is called in a different go routine. Shutdown is always called from a different go routine by Service, so it is not a unique situation. That the caller of Shutdown does not wait until shutdown is complete is a problem though. It breaks the pipeline shutdown conceptually. So in a nutshel:
|
Agree with all your points, especially about the load balancing being an exporter that has direct access to other components. Until the collector has the pieces in place that allows for a migration, I'll migrate this component.
Thanks, I think this clarifies it. @ralphgj, you can probably tweak the retry logic for the inner OTLP exporter, so that it fails faster. I suspect the backend is offline, which is causing the retry to kick in, eventually timing out. |
Thanks all! @jpkrohling Your suspect is correct. The problem described above would happened when we redeployed the sampling collector (trace data -> load balancing collector -> sampling collector).
The retry logic seems for the situation on sampling collector can't be reached. So tweak the retry logic maybe not suit this problem. In OTLP exporter comments, the shutdown process will drain the queue and call the retry. When the queue is too long, the shutdown process will spend more time.
Could you give us some other suggestions? |
Another question, now the queue seems included in endpoint exporters, and consume trace data need to select an endpoint. So the consume process will be blocked by the select endpoint process. Why not put trace data in queue when consume process and then drain the trace data from the queue for sending to remote by load balancer? |
The queued retry component does not know about the load balancer, nor the load balancer about the queued retry. Once the load balancer sent the data to the exporter, it will only get back upon a failure or success from the underlying exporter, which will only return once the retries have been exhausted. One option would be to have the queued retry to be used by the load balancer in some way or format, so that people would not use it in the final exporters, but I need to think about the implications of it. |
I won't be able to look at this one in detail for the next two months. I'm therefore unassigning this. This is now up for grabs. If nothing happens in two months, I might be able to take a look again. |
@jpkrohling can I take this one? |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@dgoscn, are you still available to work on this one? |
@jpkrohling I will have to let for someone else. If no one else assign, I want to go back to work on this as soon as possible. |
Describe the bug
Hi, we found the loadbalancing exporter sometimes would block all the requests of upload trace data. So we added some logs and found the endpoint exporters that should be removed shutdown process spent a long time.
In https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/loadbalancer.go#L121,
when loadbalancer backends changed,
lb.removeExtraExporters
sometimes spent more than 10s, so it blocked the consume processing in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/trace_exporter.go#L97 and the cause seemsonBackendChanges
hold theupdateLock
Can we make the shutdown process async like this?
In https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/loadbalancer.go#L152,
What version did you use?
Version: 0.42.0
What config did you use?
Config:
Environment
docker image: otel/opentelemetry-collector-contrib:0.42.0
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: