Fix director tests to fetch from TTL cache and respect downtime #2867

Copilot · 2025-12-11T22:36:16Z

Fix director tests to use TTL cache and respond to downed servers

Status: Ready for Review ✅

This PR fixes two issues where Director tests continued running against old hostnames after server hostname changes, and used inconsistent downtime filtering logic compared to redirect operations.

Latest Changes:

Fixed formatting issues (per @h2zh comment 3688274848):

✅ Ran gofmt -w director/monitor.go to fix formatting
✅ Removed trailing whitespace
✅ CI linter formatting checks resolved

Fixed linter errors (per @h2zh comment 3688225330):

✅ Removed unused isDowntimeActive function
✅ CI linter errors resolved

Rebased on latest main (per @h2zh request comment 3688099437):

✅ Rebased changes onto latest main after PR Fix intermittent Director-test timeout & improve code maintainability #2832 consolidated Director test logic
✅ Applied TTL cache fetching to new runDirectorTestCycle helper function structure
✅ Integrated downtime checking with consolidated code
✅ Maintains all original PR functionality with improved code structure from PR Fix intermittent Director-test timeout & improve code maintainability #2832

Downtime filtering consistency (comment 2624691717):

✅ Replaced downtime traversal with isServerInDowntime() helper in cache_ads.go
✅ Now uses filteredServers map as single source of truth
✅ Director tests use same filtering logic as redirect operations
✅ Eliminates duplicate downtime checking logic

All Changes:

director/monitor.go:

Modified LaunchPeriodicDirectorTest to accept URL string instead of ServerAd struct
Fetch fresh ServerAd from serverAds TTL cache on each test cycle
Terminate test when ServerAd not found in cache
Check filteredServers map directly via isServerInDowntime() before each test
Simplified downtime checking by using single source of truth
Integrated with consolidated runDirectorTestCycle helper from PR Fix intermittent Director-test timeout & improve code maintainability #2832
Removed unused isDowntimeActive function
Fixed formatting with gofmt
Improved log messages, correct time units, proper constants

director/cache_ads.go:

All 3 call sites already pass URL string (unchanged from main after rebase)
Added isServerInDowntime() helper function
Function checks filteredServers map for any filter type except tempAllowed

director/monitor_test.go:

Reverted to main branch version after rebase (PR Fix intermittent Director-test timeout & improve code maintainability #2832 removed comprehensive unit tests)

Problems Solved:

Tests continued running against both old and new hostnames after hostname changes, causing ~50% chance of red/green status in Director web UI
Director tests used different downtime checking logic than redirect filtering, leading to inconsistencies

Technical Details:

Hostname change handling:

ServerAd for new hostname B added to cache with key B's URL
ServerAd for old hostname A expires after TTL
Test for A gets nil from cache and terminates
Test for B continues with fresh data
Result: One active test per physical server

Downtime handling:

Checks filteredServers map directly via isServerInDowntime()
Uses same source of truth as redirect filtering
Server is in downtime if in filteredServers with any filter type except tempAllowed
No duplicate logic or inconsistencies between tests and redirects
Consistent behavior across all director operations

Integration with PR #2832 consolidation:

Works with new runDirectorTestCycle helper function
Returns boolean to indicate if test should continue or stop
Fetches fresh ServerAd from cache at start of each cycle
Checks downtime status before running test

Files Changed:

director/monitor.go: Updated with TTL cache and downtime logic on new consolidated structure
director/cache_ads.go: Added isServerInDowntime() helper (call sites unchanged after rebase)
director/monitor_test.go: Reverted to main version (contains integration tests from PR Fix intermittent Director-test timeout & improve code maintainability #2832)

Verification:

✅ Code compiles successfully
✅ CodeQL security scan: 0 alerts
✅ Rebased on latest main
✅ Works with PR #2832 consolidation
✅ Uses single source of truth for downtime filtering
✅ Consistent with redirect filtering logic
✅ All original PR goals maintained
✅ Linter errors fixed
✅ Formatting issues fixed with gofmt

Original prompt

This section details on the original issue you should resolve

<issue_title>Director tests aren't tied to ttl cache and don't respond to downed servers</issue_title>
<issue_description>We recently noticed after updating an origin's hostname in favor of a CNAME for the original hostname (ap40.uw.osg-htc is a CNAME for ospool-ap2140.chtc.wisc.edu, and we updated the latter for the former) that the Director web-ui element in the origin started reporting failed tests for the old hostname.

After a bit of investigation, it looks like the periodic Director tests are launched per hostname at the time of server ad recording, and the test utility is passed a copy of the ad (instead of fetching it from the ttl cache). Because of this, the aforementioned hostname change caused tests to fire against both the current hostname and the old hostname which ultimately point to the same server. This resulted in a ~50% chance that the web ui's director element is happy/green and a 50% chance it's red/angry.

The fix should involve launching the test utility against the Director's ttl cache, and quitting tests for a given hostname once it's no longer active.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Director tests aren't tied to ttl cache and don't respond to downed servers #1594

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

bbockelm · 2025-12-12T00:52:40Z

@jhiemstrawisc - can you take a look at this? It's a pretty good setup for a one-shot.

Best I can tell, the approach is fairly solid. One nice side-effect is there's now a second mechanism that causes transfers to get canceled (previously, it all depends on correctly canceling a context when an ad is removed -- something we've messed up a few times!).

It does correctly "disable on touch".

jhiemstrawisc

One thing I'd like to see here that I think is mostly in scope is to stop performing Director tests against services that are in an active downtime. Some of the downtime determination logic can be pretty difficult to grok, but it looks like there's a helper function director/cache_ads.go::getCachedDowntimes that will be helpful.

director/monitor.go

jhiemstrawisc · 2025-12-12T20:41:20Z

@copilot, can you address my review feedback?

Copilot · 2025-12-12T20:46:45Z

One thing I'd like to see here that I think is mostly in scope is to stop performing Director tests against services that are in an active downtime. Some of the downtime determination logic can be pre...

Added logic to skip director tests for servers in active downtime using getCachedDowntimes() helper function in commit 8f848ef. Tests now check downtime status before each test cycle and skip if the server is in an active downtime window.

director/monitor.go

jhiemstrawisc · 2025-12-12T21:45:08Z

@copilot, can you address my latest round of feedback?

director/monitor_test.go

jhiemstrawisc · 2025-12-15T17:20:54Z

@copilot, please address my latest round of feedback.

…g static copy Co-authored-by: bbockelm <[email protected]>

…ror message Co-authored-by: bbockelm <[email protected]>

h2zh

This PR does well in refetching serverAd from TTL cache on every Director-test cycle and enforcing the TTL Cache expiration mechanism so each Director-test cycle won't extend the cached ad's TTL.

Before: serverAd -> Director's TTLCache -> DirectorTest(always use the initial )

After: serverAd -> Director's TTLCache -> DirectorTest(re-fetch serverAd from TTLCache everytime)

However, another major feature "skipping Director-test for servers in downtime" is flawed. I propose a fix as below.

Besides, when a server is in downtime, the director intentionally skips tests, so the origin/cache shouldn't report "critical" status for missing director test reports.

To resolve this, one solution I can think of is that:

Add a condition to check if the server is in downtime, before throwing out the "no director test report received within ..: error (code)
Since the origin/cache don't know if they are being put in downtime by the federation admin, they need to ping API /servers/:name/downtimes to get all active and upcoming downtimes for itself, then determine if it is currently in downtime
To get the server name for the above API query, you can use database.GetServerLocalMetadata()

h2zh · 2025-12-16T20:50:01Z

director/monitor.go

+			// Check if the server is in an active downtime
+			downtimes, err := getCachedDowntimes(serverAd.Name)
+			if err != nil {
+				log.Warningf("Failed to get cached downtimes for server %s: %v. Proceeding with director test.", serverAd.Name, err)
+			} else {
+				// Check if any downtime is currently active
+				currentTime := time.Now().UTC().UnixMilli()
+				hasActiveDowntime := false
+				for _, downtime := range downtimes {
+					if isDowntimeActive(downtime, currentTime) {
+						hasActiveDowntime = true
+						log.Debugf("Skipping director test cycle for %s server %s: server is in active downtime", serverAd.Type, serverAd.Name)
+						break
+					}
+				}
+
+				if hasActiveDowntime {
+					continue
+				}
+			}


Director-test is still being triggered when this server in downtime.

I'd suggest replace this block with the following code, which passes the hand test in my local dev container. By checking filteredServers directly, the director test now uses the same source of truth as the redirect filtering, avoiding maintaining the downtimes traversal here.

Suggested change

// Check if the server is in an active downtime

downtimes, err := getCachedDowntimes(serverAd.Name)

if err != nil {

log.Warningf("Failed to get cached downtimes for server %s: %v. Proceeding with director test.", serverAd.Name, err)

} else {

// Check if any downtime is currently active

currentTime := time.Now().UTC().UnixMilli()

hasActiveDowntime := false

for _, downtime := range downtimes {

if isDowntimeActive(downtime, currentTime) {

hasActiveDowntime = true

log.Debugf("Skipping director test cycle for %s server %s: server is in active downtime", serverAd.Type, serverAd.Name)

break

}

}

if hasActiveDowntime {

continue

}

}

if isServerInDowntime(serverAd.Name) {

log.Debugf("Skipping director test cycle for %s server %s: server is in downtime", serverAd.Type, serverAd.Name)

continue

}

// This helper function should be placed in cache_ads.go

// isServerInDowntime checks if a server is in the filteredServers map with an active filter.

// A server is considered in downtime if it exists in filteredServers with any filter type except tempAllowed.

func isServerInDowntime(serverName string) bool {

filteredServersMutex.RLock()

defer filteredServersMutex.RUnlock()

existingFilterType, isServerFiltered := filteredServers[serverName]

return isServerFiltered && existingFilterType != tempAllowed

}

Implemented the suggested fix in commit 2c3d8fa. Now using isServerInDowntime() helper function that checks filteredServers map directly, ensuring director tests use the same source of truth as redirect filtering. This avoids maintaining duplicate downtime traversal logic and ensures consistency.

Replace downtime traversal logic with isServerInDowntime() which checks the filteredServers map directly. This ensures director tests use the same filtering logic as redirect filtering, avoiding inconsistencies. - Added isServerInDowntime() helper in cache_ads.go - Updated LaunchPeriodicDirectorTest to use filteredServers map - Updated setupTestDowntimes to properly set filteredServers for tests - Simplified downtime checking logic by removing redundant traversal Co-authored-by: h2zh <[email protected]>

Copilot · 2025-12-16T22:13:24Z

This PR does well in refetching serverAd from TTL cache on every Director-test cycle and enforcing the TTL Cache expiration mechanism so each Director-test cycle won't extend the cached ad's TTL.

B...

Fixed the downtime checking issue in commit 2c3d8fa by using filteredServers map as the single source of truth.

Regarding the second issue about origins/caches reporting "critical" status when in downtime: This would require changes to the origin/cache monitoring logic in server_utils/monitor.go to check for active downtimes before reporting errors. That work is outside the scope of this PR which focuses on director test behavior. Would you like me to create a separate issue/PR for that enhancement?

Co-authored-by: h2zh <[email protected]>

h2zh · 2025-12-23T22:13:29Z

@copilot Could you redo the changes to director/monitor.go and director/monitor_test.go based on the latest main? The redundant logics of Director-test was consolidated in #2832

Applied TTL cache fetching and downtime checking to the new consolidated runDirectorTestCycle structure from PR #2832. Key changes: - Modified LaunchPeriodicDirectorTest to accept URL string parameter - Fetch fresh ServerAd from TTL cache on each test cycle - Check downtime status using isServerInDowntime() before running tests - Stop test suite when ServerAd is no longer in cache - Integrated with new runDirectorTestCycle helper function structure This maintains the original PR goals while working with the improved code structure from the consolidation in PR #2832. Co-authored-by: h2zh <[email protected]>

Copilot · 2025-12-23T22:19:51Z

@copilot Could you redo the changes to director/monitor.go and director/monitor_test.go based on the latest main? The redundant logics of Director-test was consolidated in #2832

Rebased changes onto latest main in commit 17856f7. Applied TTL cache fetching and downtime checking logic to the new consolidated runDirectorTestCycle structure from PR #2832. All original functionality is maintained while working with the improved code structure from the consolidation.

h2zh · 2025-12-23T23:49:00Z

@copilot Fix the linter errors in this CI https://github.com/PelicanPlatform/pelican/actions/runs/20474108967/job/58835307956?pr=2867 Remove isDowntimeActive func since it is not used anywhere.

The isDowntimeActive function was not being used anywhere in the codebase after rebasing onto main. Removed it to address CI linter failures. Co-authored-by: h2zh <[email protected]>

Copilot · 2025-12-23T23:52:12Z

@copilot Fix the linter errors in this CI https://github.com/PelicanPlatform/pelican/actions/runs/20474108967/job/58835307956?pr=2867 Remove isDowntimeActive func since it is not used anywhere.

Removed unused isDowntimeActive function in commit 8abdcb8. The function was defined but not used anywhere after the rebase onto main, causing linter errors.

h2zh · 2025-12-24T00:27:13Z

@copilot please run gofmt -w director/monitor.go to fix the linter issues

Removed trailing whitespace to fix linter formatting check. Co-authored-by: h2zh <[email protected]>

Copilot · 2025-12-24T00:29:22Z

@copilot please run gofmt -w director/monitor.go to fix the linter issues

Ran gofmt -w director/monitor.go in commit d48eb9b. Fixed trailing whitespace formatting issue.

Copilot AI assigned Copilot and bbockelm Dec 11, 2025

Copilot started work on behalf of bbockelm December 11, 2025 22:36 View session

Copilot AI changed the title ~~[WIP] Fix director tests to respond to hostname changes and cache~~ Fix director tests to fetch from TTL cache instead of using stale copies Dec 11, 2025

Copilot AI requested a review from bbockelm December 11, 2025 22:54

Copilot finished work on behalf of bbockelm December 11, 2025 22:54

bbockelm added bug Something isn't working director Issue relating to the director component labels Dec 12, 2025

bbockelm requested a review from jhiemstrawisc December 12, 2025 00:50

jhiemstrawisc reviewed Dec 12, 2025

View reviewed changes

director/monitor.go Outdated Show resolved Hide resolved

director/monitor.go Outdated Show resolved Hide resolved

Copilot started work on behalf of jhiemstrawisc December 12, 2025 20:42 View session

Copilot AI changed the title ~~Fix director tests to fetch from TTL cache instead of using stale copies~~ Fix director tests to fetch from TTL cache and respect downtime Dec 12, 2025

Copilot finished work on behalf of jhiemstrawisc December 12, 2025 20:54

Copilot AI requested a review from jhiemstrawisc December 12, 2025 20:54

jhiemstrawisc requested changes Dec 12, 2025

View reviewed changes

director/monitor.go Outdated Show resolved Hide resolved

director/monitor.go Outdated Show resolved Hide resolved

Copilot started work on behalf of jhiemstrawisc December 12, 2025 21:46 View session

Copilot AI requested a review from jhiemstrawisc December 12, 2025 21:55

Copilot finished work on behalf of jhiemstrawisc December 12, 2025 21:55

jhiemstrawisc requested changes Dec 15, 2025

View reviewed changes

Copilot started work on behalf of jhiemstrawisc December 15, 2025 17:21 View session

Copilot AI requested a review from jhiemstrawisc December 15, 2025 17:35

Copilot finished work on behalf of jhiemstrawisc December 15, 2025 17:35

Copilot AI and others added 3 commits December 15, 2025 16:55

Initial plan

14f0b66

Fix director tests to fetch server ads from TTL cache instead of usin…

7432c8f

…g static copy Co-authored-by: bbockelm <[email protected]>

Address code review feedback: extract TTL cache option and improve er…

f28ecf7

…ror message Co-authored-by: bbockelm <[email protected]>

jhiemstrawisc requested a review from h2zh December 15, 2025 22:59

jhiemstrawisc assigned h2zh and unassigned bbockelm and Copilot Dec 15, 2025

h2zh requested changes Dec 16, 2025

View reviewed changes

Copilot started work on behalf of h2zh December 16, 2025 22:09 View session

h2zh assigned jhiemstrawisc Dec 16, 2025

Clean up whitespace formatting in test helper function

53f297b

Co-authored-by: h2zh <[email protected]>

Copilot AI requested a review from h2zh December 16, 2025 22:17

Copilot finished work on behalf of h2zh December 16, 2025 22:17

Copilot started work on behalf of h2zh December 23, 2025 22:13 View session

Copilot finished work on behalf of h2zh December 23, 2025 22:24

Merge branch 'main' into copilot/fix-director-tests-ttl-cache

e073760

Copilot started work on behalf of h2zh December 23, 2025 23:49 View session

Remove unused isDowntimeActive function to fix linter errors

8abdcb8

The isDowntimeActive function was not being used anywhere in the codebase after rebasing onto main. Removed it to address CI linter failures. Co-authored-by: h2zh <[email protected]>

Copilot finished work on behalf of h2zh December 23, 2025 23:52

Copilot started work on behalf of h2zh December 24, 2025 00:27 View session

Run gofmt to fix formatting issues in director/monitor.go

d48eb9b

Removed trailing whitespace to fix linter formatting check. Co-authored-by: h2zh <[email protected]>

Copilot finished work on behalf of h2zh December 24, 2025 00:30

h2zh approved these changes Dec 24, 2025

View reviewed changes

h2zh merged commit c8edb77 into main Dec 24, 2025
30 of 31 checks passed

Fix director tests to fetch from TTL cache and respect downtime #2867

Fix director tests to fetch from TTL cache and respect downtime #2867

Uh oh!

Conversation

Copilot AI commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix director tests to use TTL cache and respond to downed servers

Status: Ready for Review ✅

Latest Changes:

All Changes:

Problems Solved:

Technical Details:

Files Changed:

Verification:

Comments on the Issue (you are @copilot in this section)

Uh oh!

bbockelm commented Dec 12, 2025

Uh oh!

jhiemstrawisc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jhiemstrawisc commented Dec 12, 2025

Uh oh!

Copilot AI commented Dec 12, 2025

Uh oh!

Uh oh!

Uh oh!

jhiemstrawisc commented Dec 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhiemstrawisc commented Dec 15, 2025

Uh oh!

h2zh left a comment

Choose a reason for hiding this comment

Uh oh!

h2zh Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Dec 16, 2025

Uh oh!

h2zh commented Dec 23, 2025

Uh oh!

Copilot AI commented Dec 23, 2025

Uh oh!

h2zh commented Dec 23, 2025

Uh oh!

Copilot AI commented Dec 23, 2025

Uh oh!

h2zh commented Dec 24, 2025

Uh oh!

Copilot AI commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Dec 11, 2025 •

edited

Loading