Skip to content

Conversation

@brianaydemir
Copy link
Contributor

@brianaydemir brianaydemir commented Jan 1, 2026

Overview

Previously, when the director's "stat" feature encountered a hit to its cache, it would attempt to communicate the result via a channel. Unfortunately, the goroutine sending on the channel was also the same goroutine responsible for receiving on the channel, so the goroutine could become blocked on itself if the channel's buffer was at capacity. Goroutines might be inexpensive, but pile up enough of them, and memory usage will become noticeable.

Now, we'll ensure that channels' buffers have sufficient capacity.

Note

An alternative solution here is to communicate the result of a stat cache hit in its own goroutine. It's not immediately obvious to me what incurs a greater performance penalty: allocating larger channel buffers, or spawning more goroutines. That said, these hypothetical stat cache hit goroutines would do so little work that it seems silly to make them synchronize on a channel send, which means we would want the channels to be buffered, anyway. And that means we'd need to decide on how large to make the buffers…

I'm also taking this opportunity to make log messages more consistent between the various outcomes, moving said log messages to "trace", and fixing what appears to be an issue with double counting in Prometheus metrics.

My Testing

Behold, my new favorite toy, Docker Compose and a bunch of scripts and config files for starting up a data federation (this isn't substantively different from what I posted on #2928):

Out of the box, this is what I see in Grafana:

Screenshot 2026-01-02 at 09 57 02

We can see that goroutines are no longer exploding in number.

Testing Advice

You can tweak the above framework to force an unusual number of hits to the stat cache, few or no hits to the stat cache, etc. etc.

By editing environment.cfg, you can can quickly toggle between an official release of your choice, and containers built from this PR — useful for validating that you have a scenario that triggers the bug (or not), and that the PR made a difference (if necessary).

It is very much worth pulling up

and searching the output for stat.go. The former link groups and counts goroutines by their stacks; the latter lists each one separately, along with how long it's been waiting on something. In both cases, you're keeping an eye out for goroutines that might be piling up in stat.go but at a rate that's too low to be seen clearly in Grafana (I encountered this scenario while developing this PR…).

@brianaydemir brianaydemir added this to the v7.23 milestone Jan 1, 2026
@brianaydemir brianaydemir added bug Something isn't working critical High priority for next release director Issue relating to the director component labels Jan 1, 2026
@brianaydemir brianaydemir linked an issue Jan 1, 2026 that may be closed by this pull request
@brianaydemir brianaydemir marked this pull request as draft January 2, 2026 14:11
@brianaydemir
Copy link
Contributor Author

I revised my framework to make caches usable, which means you can test with more "realistic" configurations such as 2 origins and 6 caches:

It turns out I was generating certificates badly, and that I wasn't giving the caches enough time to actually pull the correct auth info for the namespaces.

I also added in creating a token, for testing with transfers that require authorization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working critical High priority for next release director Issue relating to the director component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Director's stat feature leaks goroutines

2 participants