Skip to content

Conversation

@ad-astra-video
Copy link
Collaborator

@ad-astra-video ad-astra-video commented Jan 2, 2026

What does this pull request do? Explain your changes. (required)

Fixes a bug where if the worker returns error code (e.g. 503) the Orchestrator should reset state to no stream running.

Specific updates (required)

  • closes trickle channels immediately
  • restores capacity if is non fatal error (500 status code from runner)

How did you test each of these updates (required)

Does this pull request close any open issues?

Checklist:

  • Read the contribution guide
  • make runs successfully
  • All tests in ./test.sh pass ------byoc package tests ran locally
  • README and other documentation updated
  • Pending changelog updated

@github-actions github-actions bot added the go Pull requests that update Go code label Jan 2, 2026
@codecov
Copy link

codecov bot commented Jan 2, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.25173%. Comparing base (06d50a9) to head (7c12361).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
byoc/stream_orchestrator.go 83.33333% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #3849         +/-   ##
===================================================
+ Coverage   31.93585%   32.25173%   +0.31588%     
===================================================
  Files            169         169                 
  Lines          41217       41235         +18     
===================================================
+ Hits           13163       13299        +136     
+ Misses         27066       26933        -133     
- Partials         988        1003         +15     
Files with missing lines Coverage Δ
byoc/stream_orchestrator.go 34.34343% <83.33333%> (+34.34343%) ⬆️

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79313ec...7c12361. Read the comment docs.

Files with missing lines Coverage Δ
byoc/stream_orchestrator.go 34.34343% <83.33333%> (+34.34343%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dataCh.Close()
}
controlPubCh.Close()
eventsCh.Close()
Copy link
Collaborator

@eliteprox eliteprox Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to close the publisher channels (pubCh, subCh, dataCh, controlPubCh, and eventsCh) when the worker returns an error.

Previously, these were only closed when a stream ended successfully via the monitor goroutine. If the worker failed to start the stream, these channels would have leaked.

However, there are a few other conditions in between channel creation and cleanup that could also leak these resources if validation fails.

I would suggest to move channel creation until after JSON parsing and only send request to worker at the very end, alternatively a defer block for cleanup would be a neat way to clean up resources

Copy link
Collaborator Author

@ad-astra-video ad-astra-video Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trickle channels are cleaned up when not used for a period of time (1 minute).

// How often to sweep for idle channels (default 1 minute)
SweepInterval time.Duration

This just makes it close faster and releases the capacity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 0971f38 for added coverage of other stream start failures

Comment on lines 165 to 167
if resp.StatusCode != http.StatusInternalServerError {
bso.orch.FreeExternalCapabilityCapacity(orchJob.Req.Capability)
}
Copy link
Collaborator

@eliteprox eliteprox Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only free capacity if not 500 error status code responses?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we would reserve 500 errors from runners to indicate something is wrong and not free capacity. This would help land on a good Orchestrator faster when there are runners in unrecoverable states.

The runner would need to make a call to /capability/unregister to reset the capacity when then /capability/register to reset availability.

We can remove this for now since it may be a little early to add this.

Copy link
Collaborator

@eliteprox eliteprox Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me, an orchestrator returning 500 error response is likely having issues. Are there any other mechanisms in place for removing an orchestrator from selection that we could leverage instead of clearing capacity?

This is probably outside of the scope of this PR, but I would prefer a probability scoring metric for successful/failed job requests per capability. Failed start stream requests would lower the orchestrator score and reduce chance of being selected for another job. This could be tracked within the gateway's orch sessions so that when a new session token is negotiated between gateway/orch, the score is reset, allowing G/O to reset the history by recycling either node.

This minimal approach would in theory leave room for the cloud metrics team to replace the "probability selection" with gateway aggregated metrics

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and removed the special handling of 500 errors. I think we need more usage to determine correct approach. There are two possible downsides: 1) overpaying for a bad Orchestrator (will make sure to add kafka event/metric to track failures) and 2) a slightly slower stream start with a possibility of an Orchestrator failing to start the stream. We can look at a more robust system in separate PR like your idea of somehow including a penalty (e.g. moving to bottom of list for period of time) for Orchestrators that have failed recently.

I think the initial approach would be using the runner and docker health checks to signal restarting the runner when status: ERROR is true from runner /health endpoint.

Also added coverage for other stream start failures in 0971f38 with tests.

Copy link
Collaborator

@eliteprox eliteprox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ad-astra-video ad-astra-video changed the title BYOC: fix orchestrator stream setup when worker returns error BYOC: fix orchestrator stream setup when fails Jan 7, 2026
@ad-astra-video ad-astra-video enabled auto-merge (squash) January 7, 2026 00:48
@ad-astra-video ad-astra-video merged commit e78f9c7 into livepeer:master Jan 7, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants