loadbalancing exporter blocking the process of consume traces when loadbalancer backends change #8843

ralphgj · 2022-03-25T06:41:58Z

Describe the bug
Hi, we found the loadbalancing exporter sometimes would block all the requests of upload trace data. So we added some logs and found the endpoint exporters that should be removed shutdown process spent a long time.

In https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/loadbalancer.go#L121,

	if !newRing.equal(lb.ring) {
		lb.updateLock.Lock()
		defer lb.updateLock.Unlock()

		lb.ring = newRing

		// TODO: set a timeout?
		ctx := context.Background()

		// add the missing exporters first
		lb.addMissingExporters(ctx, resolved)
		lb.removeExtraExporters(ctx, resolved)
	}

when loadbalancer backends changed, lb.removeExtraExporters sometimes spent more than 10s, so it blocked the consume processing in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/trace_exporter.go#L97 and the cause seems onBackendChanges hold the updateLock

Can we make the shutdown process async like this?
In https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/loadbalancer.go#L152,

func (lb *loadBalancerImp) removeExtraExporters(ctx context.Context, endpoints []string) {
	for existing := range lb.exporters {
		if !endpointFound(existing, endpoints) {
			go lb.exporters[existing].Shutdown(ctx)
			delete(lb.exporters, existing)
		}
	}
}

What version did you use?
Version: 0.42.0

What config did you use?
Config:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "localhost:4318"

exporters:
  logging:
    logLevel: debug
  loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 20s
          max_elapsed_time: 60s
        sending_queue:
          enabled: true
          num_consumers: 20
          queue_size: 1000000
    resolver:
      dns:
        hostname: opentelemetry-collector-sampling.otel.svc.cluster.local
        port: 4317

extensions:
  health_check:
    endpoint: "localhost:13134"

service:
  extensions: [ health_check ]
  pipelines:
    traces:
      receivers: [ otlp ]
      exporters: [ loadbalancing ]

Environment
docker image: otel/opentelemetry-collector-contrib:0.42.0

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

jpkrohling · 2022-03-25T16:37:02Z

Good catch!

@bogdandrutu, @codeboten, @tigrannajaryan, do you think calling an exporter's shutdown on a separate Go routine would be in line with the function semantics?

tigrannajaryan · 2022-03-25T17:06:23Z

Side comment: it is quite confusing that an exporter manages other exporters. We need to improve our pipelines so that this can be a processor.

As for the specifics the shutdown calls/implementation. I believe Shutdown() itself is implemented incorrectly. It sets a flag and exits where the Shutdown contract says:

If there are any background operations running by the component they must be aborted before
this function returns.
...
If there are any buffers in the
component, they should be cleared and the data sent immediately to the next component.

So, I think the Shutdown itself needs to be fixed first of all.

do you think calling an exporter's shutdown on a separate Go routine would be in line with the function semantics?

If the above contract about Shutdown's behavior is fulfilled I think it does not matter whether Shutdown is called in a different go routine. Shutdown is always called from a different go routine by Service, so it is not a unique situation. That the caller of Shutdown does not wait until shutdown is complete is a problem though. It breaks the pipeline shutdown conceptually.

So in a nutshel:

Shutdown must block until it is really done and has flushed all buffers.
Callers of Shutdown must wait until it completes before returning (or must block their own Shutdown call by whatever means available).

jpkrohling · 2022-03-25T17:11:36Z

Agree with all your points, especially about the load balancing being an exporter that has direct access to other components. Until the collector has the pieces in place that allows for a migration, I'll migrate this component.

Callers of Shutdown must wait until it completes before returning

Thanks, I think this clarifies it.

@ralphgj, you can probably tweak the retry logic for the inner OTLP exporter, so that it fails faster. I suspect the backend is offline, which is causing the retry to kick in, eventually timing out.

ralphgj · 2022-03-29T04:04:25Z

Thanks all! @jpkrohling Your suspect is correct. The problem described above would happened when we redeployed the sampling collector (trace data -> load balancing collector -> sampling collector).

you can probably tweak the retry logic for the inner OTLP exporter, so that it fails faster.

The retry logic seems for the situation on sampling collector can't be reached. So tweak the retry logic maybe not suit this problem.

In OTLP exporter comments, the shutdown process will drain the queue and call the retry. When the queue is too long, the shutdown process will spend more time.

func (qrs *queuedRetrySender) shutdown() {
	// Cleanup queue metrics reporting
	if qrs.cfg.Enabled {
		_ = globalInstruments.queueSize.UpsertEntry(func() int64 {
			return int64(0)
		}, metricdata.NewLabelValue(qrs.fullName))
	}

	// First Stop the retry goroutines, so that unblocks the queue numWorkers.
	close(qrs.retryStopCh)

	// Stop the queued sender, this will drain the queue and will call the retry (which is stopped) that will only
	// try once every request.
	if qrs.queue != nil {
		qrs.queue.Stop()
	}
}

Could you give us some other suggestions?

ralphgj · 2022-03-29T04:13:31Z

Another question, now the queue seems included in endpoint exporters, and consume trace data need to select an endpoint. So the consume process will be blocked by the select endpoint process. Why not put trace data in queue when consume process and then drain the trace data from the queue for sending to remote by load balancer?

jpkrohling · 2022-03-30T14:55:18Z

The queued retry component does not know about the load balancer, nor the load balancer about the queued retry. Once the load balancer sent the data to the exporter, it will only get back upon a failure or success from the underlying exporter, which will only return once the retries have been exhausted.

One option would be to have the queued retry to be used by the load balancer in some way or format, so that people would not use it in the final exporters, but I need to think about the implications of it.

jpkrohling · 2022-04-26T17:53:52Z

I won't be able to look at this one in detail for the next two months. I'm therefore unassigning this. This is now up for grabs. If nothing happens in two months, I might be able to take a look again.

dgoscn · 2022-09-26T18:31:42Z

@jpkrohling can I take this one?

github-actions · 2022-11-28T03:36:39Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2022-11-30T14:11:56Z

@dgoscn, are you still available to work on this one?

dgoscn · 2022-12-01T01:16:22Z

@jpkrohling I will have to let for someone else. If no one else assign, I want to go back to work on this as soon as possible.

ralphgj added the bug Something isn't working label Mar 25, 2022

jpkrohling self-assigned this Mar 30, 2022

codeboten added the exporter/loadbalancing label Jun 27, 2022

jpkrohling assigned dgoscn and unassigned jpkrohling Sep 26, 2022

github-actions bot added the Stale label Nov 28, 2022

jpkrohling added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Nov 30, 2022

jpkrohling unassigned dgoscn Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loadbalancing exporter blocking the process of consume traces when loadbalancer backends change #8843

loadbalancing exporter blocking the process of consume traces when loadbalancer backends change #8843

ralphgj commented Mar 25, 2022 •

edited

Loading

jpkrohling commented Mar 25, 2022

tigrannajaryan commented Mar 25, 2022 •

edited

Loading

jpkrohling commented Mar 25, 2022

ralphgj commented Mar 29, 2022 •

edited

Loading

ralphgj commented Mar 29, 2022 •

edited

Loading

jpkrohling commented Mar 30, 2022

jpkrohling commented Apr 26, 2022

dgoscn commented Sep 26, 2022

github-actions bot commented Nov 28, 2022

jpkrohling commented Nov 30, 2022

dgoscn commented Dec 1, 2022

loadbalancing exporter blocking the process of consume traces when loadbalancer backends change #8843

loadbalancing exporter blocking the process of consume traces when loadbalancer backends change #8843

Comments

ralphgj commented Mar 25, 2022 • edited Loading

jpkrohling commented Mar 25, 2022

tigrannajaryan commented Mar 25, 2022 • edited Loading

jpkrohling commented Mar 25, 2022

ralphgj commented Mar 29, 2022 • edited Loading

ralphgj commented Mar 29, 2022 • edited Loading

jpkrohling commented Mar 30, 2022

jpkrohling commented Apr 26, 2022

dgoscn commented Sep 26, 2022

github-actions bot commented Nov 28, 2022

jpkrohling commented Nov 30, 2022

dgoscn commented Dec 1, 2022

ralphgj commented Mar 25, 2022 •

edited

Loading

tigrannajaryan commented Mar 25, 2022 •

edited

Loading

ralphgj commented Mar 29, 2022 •

edited

Loading

ralphgj commented Mar 29, 2022 •

edited

Loading