Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loadbalancing exporter blocking the process of consume traces when loadbalancer backends change #8843

Open
ralphgj opened this issue Mar 25, 2022 · 11 comments
Labels
bug Something isn't working exporter/loadbalancing never stale Issues marked with this label will be never staled and automatically removed

Comments

@ralphgj
Copy link
Contributor

ralphgj commented Mar 25, 2022

Describe the bug
Hi, we found the loadbalancing exporter sometimes would block all the requests of upload trace data. So we added some logs and found the endpoint exporters that should be removed shutdown process spent a long time.

In https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/loadbalancer.go#L121,

	if !newRing.equal(lb.ring) {
		lb.updateLock.Lock()
		defer lb.updateLock.Unlock()

		lb.ring = newRing

		// TODO: set a timeout?
		ctx := context.Background()

		// add the missing exporters first
		lb.addMissingExporters(ctx, resolved)
		lb.removeExtraExporters(ctx, resolved)
	}

when loadbalancer backends changed, lb.removeExtraExporters sometimes spent more than 10s, so it blocked the consume processing in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/trace_exporter.go#L97 and the cause seems onBackendChanges hold the updateLock

Can we make the shutdown process async like this?
In https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/loadbalancer.go#L152,

func (lb *loadBalancerImp) removeExtraExporters(ctx context.Context, endpoints []string) {
	for existing := range lb.exporters {
		if !endpointFound(existing, endpoints) {
			go lb.exporters[existing].Shutdown(ctx)
			delete(lb.exporters, existing)
		}
	}
}

What version did you use?
Version: 0.42.0

What config did you use?
Config:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "localhost:4318"

exporters:
  logging:
    logLevel: debug
  loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 20s
          max_elapsed_time: 60s
        sending_queue:
          enabled: true
          num_consumers: 20
          queue_size: 1000000
    resolver:
      dns:
        hostname: opentelemetry-collector-sampling.otel.svc.cluster.local
        port: 4317

extensions:
  health_check:
    endpoint: "localhost:13134"

service:
  extensions: [ health_check ]
  pipelines:
    traces:
      receivers: [ otlp ]
      exporters: [ loadbalancing ]

Environment
docker image: otel/opentelemetry-collector-contrib:0.42.0

Additional context
Add any other context about the problem here.

@ralphgj ralphgj added the bug Something isn't working label Mar 25, 2022
@jpkrohling
Copy link
Member

Good catch!

@bogdandrutu, @codeboten, @tigrannajaryan, do you think calling an exporter's shutdown on a separate Go routine would be in line with the function semantics?

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Mar 25, 2022

Side comment: it is quite confusing that an exporter manages other exporters. We need to improve our pipelines so that this can be a processor.

As for the specifics the shutdown calls/implementation. I believe Shutdown() itself is implemented incorrectly. It sets a flag and exits where the Shutdown contract says:

If there are any background operations running by the component they must be aborted before
this function returns.
...
If there are any buffers in the
component, they should be cleared and the data sent immediately to the next component.

So, I think the Shutdown itself needs to be fixed first of all.

do you think calling an exporter's shutdown on a separate Go routine would be in line with the function semantics?

If the above contract about Shutdown's behavior is fulfilled I think it does not matter whether Shutdown is called in a different go routine. Shutdown is always called from a different go routine by Service, so it is not a unique situation. That the caller of Shutdown does not wait until shutdown is complete is a problem though. It breaks the pipeline shutdown conceptually.

So in a nutshel:

  • Shutdown must block until it is really done and has flushed all buffers.
  • Callers of Shutdown must wait until it completes before returning (or must block their own Shutdown call by whatever means available).

@jpkrohling
Copy link
Member

Agree with all your points, especially about the load balancing being an exporter that has direct access to other components. Until the collector has the pieces in place that allows for a migration, I'll migrate this component.

Callers of Shutdown must wait until it completes before returning

Thanks, I think this clarifies it.

@ralphgj, you can probably tweak the retry logic for the inner OTLP exporter, so that it fails faster. I suspect the backend is offline, which is causing the retry to kick in, eventually timing out.

@ralphgj
Copy link
Contributor Author

ralphgj commented Mar 29, 2022

Thanks all! @jpkrohling Your suspect is correct. The problem described above would happened when we redeployed the sampling collector (trace data -> load balancing collector -> sampling collector).

you can probably tweak the retry logic for the inner OTLP exporter, so that it fails faster.

The retry logic seems for the situation on sampling collector can't be reached. So tweak the retry logic maybe not suit this problem.

In OTLP exporter comments, the shutdown process will drain the queue and call the retry. When the queue is too long, the shutdown process will spend more time.

func (qrs *queuedRetrySender) shutdown() {
	// Cleanup queue metrics reporting
	if qrs.cfg.Enabled {
		_ = globalInstruments.queueSize.UpsertEntry(func() int64 {
			return int64(0)
		}, metricdata.NewLabelValue(qrs.fullName))
	}

	// First Stop the retry goroutines, so that unblocks the queue numWorkers.
	close(qrs.retryStopCh)

	// Stop the queued sender, this will drain the queue and will call the retry (which is stopped) that will only
	// try once every request.
	if qrs.queue != nil {
		qrs.queue.Stop()
	}
}

Could you give us some other suggestions?

@ralphgj
Copy link
Contributor Author

ralphgj commented Mar 29, 2022

Another question, now the queue seems included in endpoint exporters, and consume trace data need to select an endpoint. So the consume process will be blocked by the select endpoint process. Why not put trace data in queue when consume process and then drain the trace data from the queue for sending to remote by load balancer?

@jpkrohling jpkrohling self-assigned this Mar 30, 2022
@jpkrohling
Copy link
Member

The queued retry component does not know about the load balancer, nor the load balancer about the queued retry. Once the load balancer sent the data to the exporter, it will only get back upon a failure or success from the underlying exporter, which will only return once the retries have been exhausted.

One option would be to have the queued retry to be used by the load balancer in some way or format, so that people would not use it in the final exporters, but I need to think about the implications of it.

@jpkrohling
Copy link
Member

I won't be able to look at this one in detail for the next two months. I'm therefore unassigning this. This is now up for grabs. If nothing happens in two months, I might be able to take a look again.

@dgoscn
Copy link
Contributor

dgoscn commented Sep 26, 2022

@jpkrohling can I take this one?

@jpkrohling jpkrohling assigned dgoscn and unassigned jpkrohling Sep 26, 2022
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 28, 2022
@jpkrohling jpkrohling added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Nov 30, 2022
@jpkrohling
Copy link
Member

@dgoscn, are you still available to work on this one?

@dgoscn
Copy link
Contributor

dgoscn commented Dec 1, 2022

@jpkrohling I will have to let for someone else. If no one else assign, I want to go back to work on this as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/loadbalancing never stale Issues marked with this label will be never staled and automatically removed
Projects
None yet
Development

No branches or pull requests

5 participants