BYOC: fix orchestrator stream setup when fails #3849

ad-astra-video · 2026-01-02T15:05:54Z

What does this pull request do? Explain your changes. (required)

Fixes a bug where if the worker returns error code (e.g. 503) the Orchestrator should reset state to no stream running.

Specific updates (required)

closes trickle channels immediately
restores capacity if is non fatal error (500 status code from runner)

How did you test each of these updates (required)

Does this pull request close any open issues?

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass ------byoc package tests ran locally
README and other documentation updated
Pending changelog updated

codecov · 2026-01-02T15:25:02Z

Codecov Report

❌ Patch coverage is 83.33333% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.25173%. Comparing base (06d50a9) to head (7c12361).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
byoc/stream_orchestrator.go	83.33333%	3 Missing ⚠️

Additional details and impacted files

@@                 Coverage Diff                 @@
##              master       #3849         +/-   ##
===================================================
+ Coverage   31.93585%   32.25173%   +0.31588%     
===================================================
  Files            169         169                 
  Lines          41217       41235         +18     
===================================================
+ Hits           13163       13299        +136     
+ Misses         27066       26933        -133     
- Partials         988        1003         +15

Files with missing lines	Coverage Δ
byoc/stream_orchestrator.go	`34.34343% <83.33333%> (+34.34343%)`	⬆️

... and 3 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79313ec...7c12361. Read the comment docs.

Files with missing lines	Coverage Δ
byoc/stream_orchestrator.go	`34.34343% <83.33333%> (+34.34343%)`	⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

eliteprox · 2026-01-06T17:57:32Z

byoc/stream_orchestrator.go

+				dataCh.Close()
+			}
+			controlPubCh.Close()
+			eventsCh.Close()


This is good to close the publisher channels (pubCh, subCh, dataCh, controlPubCh, and eventsCh) when the worker returns an error.

Previously, these were only closed when a stream ended successfully via the monitor goroutine. If the worker failed to start the stream, these channels would have leaked.

However, there are a few other conditions in between channel creation and cleanup that could also leak these resources if validation fails.

https://github.com/muxionlabs/go-livepeer/blob/692ff3f271140eb97724b3c809baa0007ad6d7e6/byoc/stream_orchestrator.go#L139-L144

https://github.com/muxionlabs/go-livepeer/blob/692ff3f271140eb97724b3c809baa0007ad6d7e6/byoc/stream_orchestrator.go#L146-L152

I would suggest to move channel creation until after JSON parsing and only send request to worker at the very end, alternatively a defer block for cleanup would be a neat way to clean up resources

Trickle channels are cleaned up when not used for a period of time (1 minute).

go-livepeer/trickle/trickle_server.go

Lines 37 to 38 in 79313ec

// How often to sweep for idle channels (default 1 minute)

SweepInterval time.Duration

This just makes it close faster and releases the capacity.

See 0971f38 for added coverage of other stream start failures

eliteprox · 2026-01-06T17:58:30Z

byoc/stream_orchestrator.go

+			if resp.StatusCode != http.StatusInternalServerError {
+				bso.orch.FreeExternalCapabilityCapacity(orchJob.Req.Capability)
+			}


Why do we only free capacity if not 500 error status code responses?

I was thinking we would reserve 500 errors from runners to indicate something is wrong and not free capacity. This would help land on a good Orchestrator faster when there are runners in unrecoverable states.

The runner would need to make a call to /capability/unregister to reset the capacity when then /capability/register to reset availability.

We can remove this for now since it may be a little early to add this.

Sounds reasonable to me, an orchestrator returning 500 error response is likely having issues. Are there any other mechanisms in place for removing an orchestrator from selection that we could leverage instead of clearing capacity?

This is probably outside of the scope of this PR, but I would prefer a probability scoring metric for successful/failed job requests per capability. Failed start stream requests would lower the orchestrator score and reduce chance of being selected for another job. This could be tracked within the gateway's orch sessions so that when a new session token is negotiated between gateway/orch, the score is reset, allowing G/O to reset the history by recycling either node.

This minimal approach would in theory leave room for the cloud metrics team to replace the "probability selection" with gateway aggregated metrics

I went ahead and removed the special handling of 500 errors. I think we need more usage to determine correct approach. There are two possible downsides: 1) overpaying for a bad Orchestrator (will make sure to add kafka event/metric to track failures) and 2) a slightly slower stream start with a possibility of an Orchestrator failing to start the stream. We can look at a more robust system in separate PR like your idea of somehow including a penalty (e.g. moving to bottom of list for period of time) for Orchestrators that have failed recently.

I think the initial approach would be using the runner and docker health checks to signal restarting the runner when status: ERROR is true from runner /health endpoint.

Also added coverage for other stream start failures in 0971f38 with tests.

eliteprox

lgtm

close trickle and release capacity if worker request fails

66d9c89

ad-astra-video requested review from eliteprox and pschroedl January 2, 2026 15:05

github-actions bot added the go Pull requests that update Go code label Jan 2, 2026

ad-astra-video and others added 3 commits January 3, 2026 19:24

add test

e96091d

Merge branch 'master' into fix-worker-req-fail-on-streams

bd472b2

reset tracker of freeCapacity at start of each test

692ff3f

eliteprox requested changes Jan 6, 2026

View reviewed changes

cover additional stream start failures with cleanup

0971f38

ad-astra-video requested a review from eliteprox January 6, 2026 22:29

eliteprox approved these changes Jan 6, 2026

View reviewed changes

ad-astra-video changed the title ~~BYOC: fix orchestrator stream setup when worker returns error~~ BYOC: fix orchestrator stream setup when fails Jan 7, 2026

update changelog_pending

7c12361

ad-astra-video enabled auto-merge (squash) January 7, 2026 00:48

ad-astra-video merged commit e78f9c7 into livepeer:master Jan 7, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BYOC: fix orchestrator stream setup when fails #3849

BYOC: fix orchestrator stream setup when fails #3849

Uh oh!

ad-astra-video commented Jan 2, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 2, 2026 •

edited

Loading

Uh oh!

eliteprox Jan 6, 2026 •

edited

Loading

Uh oh!

ad-astra-video Jan 6, 2026 •

edited

Loading

Uh oh!

ad-astra-video Jan 6, 2026

Uh oh!

eliteprox Jan 6, 2026 •

edited

Loading

Uh oh!

ad-astra-video Jan 6, 2026

Uh oh!

eliteprox Jan 6, 2026 •

edited

Loading

Uh oh!

ad-astra-video Jan 6, 2026

Uh oh!

eliteprox left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// How often to sweep for idle channels (default 1 minute)
	SweepInterval time.Duration

BYOC: fix orchestrator stream setup when fails #3849

BYOC: fix orchestrator stream setup when fails #3849

Uh oh!

Conversation

ad-astra-video commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eliteprox Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ad-astra-video Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ad-astra-video Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

eliteprox Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ad-astra-video Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

eliteprox Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ad-astra-video Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

eliteprox left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ad-astra-video commented Jan 2, 2026 •

edited

Loading

codecov bot commented Jan 2, 2026 •

edited

Loading

eliteprox Jan 6, 2026 •

edited

Loading

ad-astra-video Jan 6, 2026 •

edited

Loading

eliteprox Jan 6, 2026 •

edited

Loading

eliteprox Jan 6, 2026 •

edited

Loading