Fix director's stat feature to no longer "leak" goroutines #2935
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
Previously, when the director's "stat" feature encountered a hit to its cache, it would attempt to communicate the result via a channel. Unfortunately, the goroutine sending on the channel was also the same goroutine responsible for receiving on the channel, so the goroutine could become blocked on itself if the channel's buffer was at capacity. Goroutines might be inexpensive, but pile up enough of them, and memory usage will become noticeable.
Now, we'll ensure that channels' buffers have sufficient capacity.
Note
An alternative solution here is to communicate the result of a stat cache hit in its own goroutine. It's not immediately obvious to me what incurs a greater performance penalty: allocating larger channel buffers, or spawning more goroutines. That said, these hypothetical stat cache hit goroutines would do so little work that it seems silly to make them synchronize on a channel send, which means we would want the channels to be buffered, anyway. And that means we'd need to decide on how large to make the buffers…
I'm also taking this opportunity to make log messages more consistent between the various outcomes, moving said log messages to "trace", and fixing what appears to be an issue with double counting in Prometheus metrics.
My Testing
Behold, my new favorite toy, Docker Compose and a bunch of scripts and config files for starting up a data federation (this isn't substantively different from what I posted on #2928):
Out of the box, this is what I see in Grafana:
We can see that goroutines are no longer exploding in number.
Testing Advice
You can tweak the above framework to force an unusual number of hits to the stat cache, few or no hits to the stat cache, etc. etc.
By editing
environment.cfg, you can can quickly toggle between an official release of your choice, and containers built from this PR — useful for validating that you have a scenario that triggers the bug (or not), and that the PR made a difference (if necessary).It is very much worth pulling up
and searching the output for
stat.go. The former link groups and counts goroutines by their stacks; the latter lists each one separately, along with how long it's been waiting on something. In both cases, you're keeping an eye out for goroutines that might be piling up instat.gobut at a rate that's too low to be seen clearly in Grafana (I encountered this scenario while developing this PR…).