Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 11, 2025

Fix director tests to use TTL cache and respond to downed servers

Status: Ready for Review ✅

This PR fixes two issues where Director tests continued running against old hostnames after server hostname changes, and used inconsistent downtime filtering logic compared to redirect operations.

Latest Changes:

Fixed formatting issues (per @h2zh comment 3688274848):

  • ✅ Ran gofmt -w director/monitor.go to fix formatting
  • ✅ Removed trailing whitespace
  • ✅ CI linter formatting checks resolved

Fixed linter errors (per @h2zh comment 3688225330):

  • ✅ Removed unused isDowntimeActive function
  • ✅ CI linter errors resolved

Rebased on latest main (per @h2zh request comment 3688099437):

Downtime filtering consistency (comment 2624691717):

  1. ✅ Replaced downtime traversal with isServerInDowntime() helper in cache_ads.go
  2. ✅ Now uses filteredServers map as single source of truth
  3. ✅ Director tests use same filtering logic as redirect operations
  4. ✅ Eliminates duplicate downtime checking logic

All Changes:

director/monitor.go:

  • Modified LaunchPeriodicDirectorTest to accept URL string instead of ServerAd struct
  • Fetch fresh ServerAd from serverAds TTL cache on each test cycle
  • Terminate test when ServerAd not found in cache
  • Check filteredServers map directly via isServerInDowntime() before each test
  • Simplified downtime checking by using single source of truth
  • Integrated with consolidated runDirectorTestCycle helper from PR Fix intermittent Director-test timeout & improve code maintainability #2832
  • Removed unused isDowntimeActive function
  • Fixed formatting with gofmt
  • Improved log messages, correct time units, proper constants

director/cache_ads.go:

  • All 3 call sites already pass URL string (unchanged from main after rebase)
  • Added isServerInDowntime() helper function
  • Function checks filteredServers map for any filter type except tempAllowed

director/monitor_test.go:

Problems Solved:

  1. Tests continued running against both old and new hostnames after hostname changes, causing ~50% chance of red/green status in Director web UI
  2. Director tests used different downtime checking logic than redirect filtering, leading to inconsistencies

Technical Details:

Hostname change handling:

  1. ServerAd for new hostname B added to cache with key B's URL
  2. ServerAd for old hostname A expires after TTL
  3. Test for A gets nil from cache and terminates
  4. Test for B continues with fresh data
  5. Result: One active test per physical server

Downtime handling:

  • Checks filteredServers map directly via isServerInDowntime()
  • Uses same source of truth as redirect filtering
  • Server is in downtime if in filteredServers with any filter type except tempAllowed
  • No duplicate logic or inconsistencies between tests and redirects
  • Consistent behavior across all director operations

Integration with PR #2832 consolidation:

  • Works with new runDirectorTestCycle helper function
  • Returns boolean to indicate if test should continue or stop
  • Fetches fresh ServerAd from cache at start of each cycle
  • Checks downtime status before running test

Files Changed:

Verification:

✅ Code compiles successfully
✅ CodeQL security scan: 0 alerts
✅ Rebased on latest main
✅ Works with PR #2832 consolidation
✅ Uses single source of truth for downtime filtering
✅ Consistent with redirect filtering logic
✅ All original PR goals maintained
✅ Linter errors fixed
✅ Formatting issues fixed with gofmt

Original prompt

This section details on the original issue you should resolve

<issue_title>Director tests aren't tied to ttl cache and don't respond to downed servers</issue_title>
<issue_description>We recently noticed after updating an origin's hostname in favor of a CNAME for the original hostname (ap40.uw.osg-htc is a CNAME for ospool-ap2140.chtc.wisc.edu, and we updated the latter for the former) that the Director web-ui element in the origin started reporting failed tests for the old hostname.

After a bit of investigation, it looks like the periodic Director tests are launched per hostname at the time of server ad recording, and the test utility is passed a copy of the ad (instead of fetching it from the ttl cache). Because of this, the aforementioned hostname change caused tests to fire against both the current hostname and the old hostname which ultimately point to the same server. This resulted in a ~50% chance that the web ui's director element is happy/green and a 50% chance it's red/angry.

The fix should involve launching the test utility against the Director's ttl cache, and quitting tests for a given hostname once it's no longer active.</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Fix director tests to respond to hostname changes and cache Fix director tests to fetch from TTL cache instead of using stale copies Dec 11, 2025
Copilot AI requested a review from bbockelm December 11, 2025 22:54
@bbockelm bbockelm added bug Something isn't working director Issue relating to the director component labels Dec 12, 2025
@bbockelm
Copy link
Collaborator

@jhiemstrawisc - can you take a look at this? It's a pretty good setup for a one-shot.

Best I can tell, the approach is fairly solid. One nice side-effect is there's now a second mechanism that causes transfers to get canceled (previously, it all depends on correctly canceling a context when an ad is removed -- something we've messed up a few times!).

It does correctly "disable on touch".

Copy link
Member

@jhiemstrawisc jhiemstrawisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'd like to see here that I think is mostly in scope is to stop performing Director tests against services that are in an active downtime. Some of the downtime determination logic can be pretty difficult to grok, but it looks like there's a helper function director/cache_ads.go::getCachedDowntimes that will be helpful.

@jhiemstrawisc
Copy link
Member

@copilot, can you address my review feedback?

Copy link
Contributor Author

Copilot AI commented Dec 12, 2025

One thing I'd like to see here that I think is mostly in scope is to stop performing Director tests against services that are in an active downtime. Some of the downtime determination logic can be pre...

Added logic to skip director tests for servers in active downtime using getCachedDowntimes() helper function in commit 8f848ef. Tests now check downtime status before each test cycle and skip if the server is in an active downtime window.

Copilot AI changed the title Fix director tests to fetch from TTL cache instead of using stale copies Fix director tests to fetch from TTL cache and respect downtime Dec 12, 2025
Copilot AI requested a review from jhiemstrawisc December 12, 2025 20:54
@jhiemstrawisc
Copy link
Member

@copilot, can you address my latest round of feedback?

@jhiemstrawisc
Copy link
Member

@copilot, please address my latest round of feedback.

@jhiemstrawisc jhiemstrawisc requested a review from h2zh December 15, 2025 22:59
@jhiemstrawisc jhiemstrawisc assigned h2zh and unassigned bbockelm and Copilot Dec 15, 2025
Copy link
Contributor

@h2zh h2zh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does well in refetching serverAd from TTL cache on every Director-test cycle and enforcing the TTL Cache expiration mechanism so each Director-test cycle won't extend the cached ad's TTL.

Before: serverAd -> Director's TTLCache -> DirectorTest(always use the initial )

After: serverAd -> Director's TTLCache -> DirectorTest(re-fetch serverAd from TTLCache everytime)

However, another major feature "skipping Director-test for servers in downtime" is flawed. I propose a fix as below.

Besides, when a server is in downtime, the director intentionally skips tests, so the origin/cache shouldn't report "critical" status for missing director test reports.
image

To resolve this, one solution I can think of is that:

  • Add a condition to check if the server is in downtime, before throwing out the "no director test report received within ..: error (code)
  • Since the origin/cache don't know if they are being put in downtime by the federation admin, they need to ping API /servers/:name/downtimes to get all active and upcoming downtimes for itself, then determine if it is currently in downtime
  • To get the server name for the above API query, you can use database.GetServerLocalMetadata()

Comment on lines 205 to 224
// Check if the server is in an active downtime
downtimes, err := getCachedDowntimes(serverAd.Name)
if err != nil {
log.Warningf("Failed to get cached downtimes for server %s: %v. Proceeding with director test.", serverAd.Name, err)
} else {
// Check if any downtime is currently active
currentTime := time.Now().UTC().UnixMilli()
hasActiveDowntime := false
for _, downtime := range downtimes {
if isDowntimeActive(downtime, currentTime) {
hasActiveDowntime = true
log.Debugf("Skipping director test cycle for %s server %s: server is in active downtime", serverAd.Type, serverAd.Name)
break
}
}

if hasActiveDowntime {
continue
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Director-test is still being triggered when this server in downtime.

I'd suggest replace this block with the following code, which passes the hand test in my local dev container. By checking filteredServers directly, the director test now uses the same source of truth as the redirect filtering, avoiding maintaining the downtimes traversal here.

Suggested change
// Check if the server is in an active downtime
downtimes, err := getCachedDowntimes(serverAd.Name)
if err != nil {
log.Warningf("Failed to get cached downtimes for server %s: %v. Proceeding with director test.", serverAd.Name, err)
} else {
// Check if any downtime is currently active
currentTime := time.Now().UTC().UnixMilli()
hasActiveDowntime := false
for _, downtime := range downtimes {
if isDowntimeActive(downtime, currentTime) {
hasActiveDowntime = true
log.Debugf("Skipping director test cycle for %s server %s: server is in active downtime", serverAd.Type, serverAd.Name)
break
}
}
if hasActiveDowntime {
continue
}
}
if isServerInDowntime(serverAd.Name) {
log.Debugf("Skipping director test cycle for %s server %s: server is in downtime", serverAd.Type, serverAd.Name)
continue
}
// This helper function should be placed in cache_ads.go
// isServerInDowntime checks if a server is in the filteredServers map with an active filter.
// A server is considered in downtime if it exists in filteredServers with any filter type except tempAllowed.
func isServerInDowntime(serverName string) bool {
filteredServersMutex.RLock()
defer filteredServersMutex.RUnlock()
existingFilterType, isServerFiltered := filteredServers[serverName]
return isServerFiltered && existingFilterType != tempAllowed
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the suggested fix in commit 2c3d8fa. Now using isServerInDowntime() helper function that checks filteredServers map directly, ensuring director tests use the same source of truth as redirect filtering. This avoids maintaining duplicate downtime traversal logic and ensures consistency.

Replace downtime traversal logic with isServerInDowntime() which checks
the filteredServers map directly. This ensures director tests use the
same filtering logic as redirect filtering, avoiding inconsistencies.

- Added isServerInDowntime() helper in cache_ads.go
- Updated LaunchPeriodicDirectorTest to use filteredServers map
- Updated setupTestDowntimes to properly set filteredServers for tests
- Simplified downtime checking logic by removing redundant traversal

Co-authored-by: h2zh <[email protected]>
Copy link
Contributor Author

Copilot AI commented Dec 16, 2025

This PR does well in refetching serverAd from TTL cache on every Director-test cycle and enforcing the TTL Cache expiration mechanism so each Director-test cycle won't extend the cached ad's TTL.

B...

Fixed the downtime checking issue in commit 2c3d8fa by using filteredServers map as the single source of truth.

Regarding the second issue about origins/caches reporting "critical" status when in downtime: This would require changes to the origin/cache monitoring logic in server_utils/monitor.go to check for active downtimes before reporting errors. That work is outside the scope of this PR which focuses on director test behavior. Would you like me to create a separate issue/PR for that enhancement?

Copilot AI requested a review from h2zh December 16, 2025 22:17
@h2zh
Copy link
Contributor

h2zh commented Dec 23, 2025

@copilot Could you redo the changes to director/monitor.go and director/monitor_test.go based on the latest main? The redundant logics of Director-test was consolidated in #2832

Applied TTL cache fetching and downtime checking to the new consolidated
runDirectorTestCycle structure from PR #2832. Key changes:

- Modified LaunchPeriodicDirectorTest to accept URL string parameter
- Fetch fresh ServerAd from TTL cache on each test cycle
- Check downtime status using isServerInDowntime() before running tests
- Stop test suite when ServerAd is no longer in cache
- Integrated with new runDirectorTestCycle helper function structure

This maintains the original PR goals while working with the improved
code structure from the consolidation in PR #2832.

Co-authored-by: h2zh <[email protected]>
Copy link
Contributor Author

Copilot AI commented Dec 23, 2025

@copilot Could you redo the changes to director/monitor.go and director/monitor_test.go based on the latest main? The redundant logics of Director-test was consolidated in #2832

Rebased changes onto latest main in commit 17856f7. Applied TTL cache fetching and downtime checking logic to the new consolidated runDirectorTestCycle structure from PR #2832. All original functionality is maintained while working with the improved code structure from the consolidation.

@h2zh
Copy link
Contributor

h2zh commented Dec 23, 2025

@copilot Fix the linter errors in this CI https://github.com/PelicanPlatform/pelican/actions/runs/20474108967/job/58835307956?pr=2867 Remove isDowntimeActive func since it is not used anywhere.

The isDowntimeActive function was not being used anywhere in the codebase
after rebasing onto main. Removed it to address CI linter failures.

Co-authored-by: h2zh <[email protected]>
Copy link
Contributor Author

Copilot AI commented Dec 23, 2025

@copilot Fix the linter errors in this CI https://github.com/PelicanPlatform/pelican/actions/runs/20474108967/job/58835307956?pr=2867 Remove isDowntimeActive func since it is not used anywhere.

Removed unused isDowntimeActive function in commit 8abdcb8. The function was defined but not used anywhere after the rebase onto main, causing linter errors.

@h2zh
Copy link
Contributor

h2zh commented Dec 24, 2025

@copilot please run gofmt -w director/monitor.go to fix the linter issues

Removed trailing whitespace to fix linter formatting check.

Co-authored-by: h2zh <[email protected]>
Copy link
Contributor Author

Copilot AI commented Dec 24, 2025

@copilot please run gofmt -w director/monitor.go to fix the linter issues

Ran gofmt -w director/monitor.go in commit d48eb9b. Fixed trailing whitespace formatting issue.

@h2zh h2zh merged commit c8edb77 into main Dec 24, 2025
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working director Issue relating to the director component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Director tests aren't tied to ttl cache and don't respond to downed servers

4 participants