Fix director's stat feature to no longer "leak" goroutines #2935

brianaydemir · 2026-01-01T19:36:08Z

Overview

Previously, when the director's "stat" feature encountered a hit to its cache, it would attempt to communicate the result via a channel. Unfortunately, the goroutine sending on the channel was also the same goroutine responsible for receiving on the channel, so the goroutine could become blocked on itself if the channel's buffer was at capacity. Goroutines might be inexpensive, but pile up enough of them, and memory usage will become noticeable.

Now, we'll ensure that channels' buffers have sufficient capacity.

Note

An alternative solution here is to communicate the result of a stat cache hit in its own goroutine. It's not immediately obvious to me what incurs a greater performance penalty: allocating larger channel buffers, or spawning more goroutines. That said, these hypothetical stat cache hit goroutines would do so little work that it seems silly to make them synchronize on a channel send, which means we would want the channels to be buffered, anyway. And that means we'd need to decide on how large to make the buffers…

I'm also taking this opportunity to make log messages more consistent between the various outcomes, moving said log messages to "trace", and fixing what appears to be an issue with double counting in Prometheus metrics.

My Testing

Behold, my new favorite toy, Docker Compose and a bunch of scripts and config files for starting up a data federation (this isn't substantively different from what I posted on #2928):

2026-01-02-pelican-test-framework.tar.gz

Out of the box, this is what I see in Grafana:

We can see that goroutines are no longer exploding in number.

Testing Advice

You can tweak the above framework to force an unusual number of hits to the stat cache, few or no hits to the stat cache, etc. etc.

By editing environment.cfg, you can can quickly toggle between an official release of your choice, and containers built from this PR — useful for validating that you have a scenario that triggers the bug (or not), and that the PR made a difference (if necessary).

It is very much worth pulling up

and searching the output for stat.go. The former link groups and counts goroutines by their stacks; the latter lists each one separately, along with how long it's been waiting on something. In both cases, you're keeping an eye out for goroutines that might be piling up in stat.go but at a rate that's too low to be seen clearly in Grafana (I encountered this scenario while developing this PR…).

brianaydemir · 2026-01-04T21:22:44Z

I revised my framework to make caches usable, which means you can test with more "realistic" configurations such as 2 origins and 6 caches:

2026-01-04-pelican-test-framework.tar.gz

It turns out I was generating certificates badly, and that I wasn't giving the caches enough time to actually pull the correct auth info for the namespaces.

I also added in creating a token, for testing with transfers that require authorization.

brianaydemir added this to the v7.23 milestone Jan 1, 2026

brianaydemir requested a review from jhiemstrawisc January 1, 2026 19:36

brianaydemir assigned brianaydemir and jhiemstrawisc Jan 1, 2026

brianaydemir added bug Something isn't working critical High priority for next release director Issue relating to the director component labels Jan 1, 2026

brianaydemir linked an issue Jan 1, 2026 that may be closed by this pull request

Director's stat feature leaks goroutines #2928

Open

brianaydemir marked this pull request as draft January 2, 2026 14:11

Ensure that stat routines do not block on sending to buffered channels

8fd96a5

brianaydemir force-pushed the fix-director-stat branch from 0f25a59 to 8fd96a5 Compare January 2, 2026 15:24

brianaydemir marked this pull request as ready for review January 2, 2026 16:04

brianaydemir mentioned this pull request Jan 4, 2026

Client emits a useless "Failure when reading…" warning #2939

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix director's stat feature to no longer "leak" goroutines #2935

Fix director's stat feature to no longer "leak" goroutines #2935

Uh oh!

brianaydemir commented Jan 1, 2026 •

edited

Loading

Uh oh!

brianaydemir commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix director's stat feature to no longer "leak" goroutines #2935

Are you sure you want to change the base?

Fix director's stat feature to no longer "leak" goroutines #2935

Uh oh!

Conversation

brianaydemir commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

My Testing

Testing Advice

Uh oh!

brianaydemir commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brianaydemir commented Jan 1, 2026 •

edited

Loading