-
Notifications
You must be signed in to change notification settings - Fork 207
BYOC: fix orchestrator stream setup when fails #3849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BYOC: fix orchestrator stream setup when fails #3849
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3849 +/- ##
===================================================
+ Coverage 31.93585% 32.25173% +0.31588%
===================================================
Files 169 169
Lines 41217 41235 +18
===================================================
+ Hits 13163 13299 +136
+ Misses 27066 26933 -133
- Partials 988 1003 +15
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
byoc/stream_orchestrator.go
Outdated
| dataCh.Close() | ||
| } | ||
| controlPubCh.Close() | ||
| eventsCh.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to close the publisher channels (pubCh, subCh, dataCh, controlPubCh, and eventsCh) when the worker returns an error.
Previously, these were only closed when a stream ended successfully via the monitor goroutine. If the worker failed to start the stream, these channels would have leaked.
However, there are a few other conditions in between channel creation and cleanup that could also leak these resources if validation fails.
- https://github.com/muxionlabs/go-livepeer/blob/692ff3f271140eb97724b3c809baa0007ad6d7e6/byoc/stream_orchestrator.go#L139-L144
- https://github.com/muxionlabs/go-livepeer/blob/692ff3f271140eb97724b3c809baa0007ad6d7e6/byoc/stream_orchestrator.go#L146-L152
I would suggest to move channel creation until after JSON parsing and only send request to worker at the very end, alternatively a defer block for cleanup would be a neat way to clean up resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trickle channels are cleaned up when not used for a period of time (1 minute).
go-livepeer/trickle/trickle_server.go
Lines 37 to 38 in 79313ec
| // How often to sweep for idle channels (default 1 minute) | |
| SweepInterval time.Duration |
This just makes it close faster and releases the capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 0971f38 for added coverage of other stream start failures
byoc/stream_orchestrator.go
Outdated
| if resp.StatusCode != http.StatusInternalServerError { | ||
| bso.orch.FreeExternalCapabilityCapacity(orchJob.Req.Capability) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we only free capacity if not 500 error status code responses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we would reserve 500 errors from runners to indicate something is wrong and not free capacity. This would help land on a good Orchestrator faster when there are runners in unrecoverable states.
The runner would need to make a call to /capability/unregister to reset the capacity when then /capability/register to reset availability.
We can remove this for now since it may be a little early to add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me, an orchestrator returning 500 error response is likely having issues. Are there any other mechanisms in place for removing an orchestrator from selection that we could leverage instead of clearing capacity?
This is probably outside of the scope of this PR, but I would prefer a probability scoring metric for successful/failed job requests per capability. Failed start stream requests would lower the orchestrator score and reduce chance of being selected for another job. This could be tracked within the gateway's orch sessions so that when a new session token is negotiated between gateway/orch, the score is reset, allowing G/O to reset the history by recycling either node.
This minimal approach would in theory leave room for the cloud metrics team to replace the "probability selection" with gateway aggregated metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and removed the special handling of 500 errors. I think we need more usage to determine correct approach. There are two possible downsides: 1) overpaying for a bad Orchestrator (will make sure to add kafka event/metric to track failures) and 2) a slightly slower stream start with a possibility of an Orchestrator failing to start the stream. We can look at a more robust system in separate PR like your idea of somehow including a penalty (e.g. moving to bottom of list for period of time) for Orchestrators that have failed recently.
I think the initial approach would be using the runner and docker health checks to signal restarting the runner when status: ERROR is true from runner /health endpoint.
Also added coverage for other stream start failures in 0971f38 with tests.
eliteprox
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
What does this pull request do? Explain your changes. (required)
Fixes a bug where if the worker returns error code (e.g. 503) the Orchestrator should reset state to no stream running.
Specific updates (required)
How did you test each of these updates (required)
Does this pull request close any open issues?
Checklist:
makeruns successfully./test.shpass ------byocpackage tests ran locally