-
Notifications
You must be signed in to change notification settings - Fork 31
Fix director tests to fetch from TTL cache and respect downtime #2867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jhiemstrawisc - can you take a look at this? It's a pretty good setup for a one-shot. Best I can tell, the approach is fairly solid. One nice side-effect is there's now a second mechanism that causes transfers to get canceled (previously, it all depends on correctly canceling a context when an ad is removed -- something we've messed up a few times!). It does correctly "disable on touch". |
jhiemstrawisc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'd like to see here that I think is mostly in scope is to stop performing Director tests against services that are in an active downtime. Some of the downtime determination logic can be pretty difficult to grok, but it looks like there's a helper function director/cache_ads.go::getCachedDowntimes that will be helpful.
|
@copilot, can you address my review feedback? |
Added logic to skip director tests for servers in active downtime using |
|
@copilot, can you address my latest round of feedback? |
|
@copilot, please address my latest round of feedback. |
…g static copy Co-authored-by: bbockelm <[email protected]>
…ror message Co-authored-by: bbockelm <[email protected]>
h2zh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does well in refetching serverAd from TTL cache on every Director-test cycle and enforcing the TTL Cache expiration mechanism so each Director-test cycle won't extend the cached ad's TTL.
Before: serverAd -> Director's TTLCache -> DirectorTest(always use the initial )
After: serverAd -> Director's TTLCache -> DirectorTest(re-fetch serverAd from TTLCache everytime)
However, another major feature "skipping Director-test for servers in downtime" is flawed. I propose a fix as below.
Besides, when a server is in downtime, the director intentionally skips tests, so the origin/cache shouldn't report "critical" status for missing director test reports.

To resolve this, one solution I can think of is that:
- Add a condition to check if the server is in downtime, before throwing out the "no director test report received within ..: error (code)
- Since the origin/cache don't know if they are being put in downtime by the federation admin, they need to ping API
/servers/:name/downtimesto get all active and upcoming downtimes for itself, then determine if it is currently in downtime - To get the server name for the above API query, you can use
database.GetServerLocalMetadata()
director/monitor.go
Outdated
| // Check if the server is in an active downtime | ||
| downtimes, err := getCachedDowntimes(serverAd.Name) | ||
| if err != nil { | ||
| log.Warningf("Failed to get cached downtimes for server %s: %v. Proceeding with director test.", serverAd.Name, err) | ||
| } else { | ||
| // Check if any downtime is currently active | ||
| currentTime := time.Now().UTC().UnixMilli() | ||
| hasActiveDowntime := false | ||
| for _, downtime := range downtimes { | ||
| if isDowntimeActive(downtime, currentTime) { | ||
| hasActiveDowntime = true | ||
| log.Debugf("Skipping director test cycle for %s server %s: server is in active downtime", serverAd.Type, serverAd.Name) | ||
| break | ||
| } | ||
| } | ||
|
|
||
| if hasActiveDowntime { | ||
| continue | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Director-test is still being triggered when this server in downtime.
I'd suggest replace this block with the following code, which passes the hand test in my local dev container. By checking filteredServers directly, the director test now uses the same source of truth as the redirect filtering, avoiding maintaining the downtimes traversal here.
| // Check if the server is in an active downtime | |
| downtimes, err := getCachedDowntimes(serverAd.Name) | |
| if err != nil { | |
| log.Warningf("Failed to get cached downtimes for server %s: %v. Proceeding with director test.", serverAd.Name, err) | |
| } else { | |
| // Check if any downtime is currently active | |
| currentTime := time.Now().UTC().UnixMilli() | |
| hasActiveDowntime := false | |
| for _, downtime := range downtimes { | |
| if isDowntimeActive(downtime, currentTime) { | |
| hasActiveDowntime = true | |
| log.Debugf("Skipping director test cycle for %s server %s: server is in active downtime", serverAd.Type, serverAd.Name) | |
| break | |
| } | |
| } | |
| if hasActiveDowntime { | |
| continue | |
| } | |
| } | |
| if isServerInDowntime(serverAd.Name) { | |
| log.Debugf("Skipping director test cycle for %s server %s: server is in downtime", serverAd.Type, serverAd.Name) | |
| continue | |
| } | |
| // This helper function should be placed in cache_ads.go | |
| // isServerInDowntime checks if a server is in the filteredServers map with an active filter. | |
| // A server is considered in downtime if it exists in filteredServers with any filter type except tempAllowed. | |
| func isServerInDowntime(serverName string) bool { | |
| filteredServersMutex.RLock() | |
| defer filteredServersMutex.RUnlock() | |
| existingFilterType, isServerFiltered := filteredServers[serverName] | |
| return isServerFiltered && existingFilterType != tempAllowed | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented the suggested fix in commit 2c3d8fa. Now using isServerInDowntime() helper function that checks filteredServers map directly, ensuring director tests use the same source of truth as redirect filtering. This avoids maintaining duplicate downtime traversal logic and ensures consistency.
Replace downtime traversal logic with isServerInDowntime() which checks the filteredServers map directly. This ensures director tests use the same filtering logic as redirect filtering, avoiding inconsistencies. - Added isServerInDowntime() helper in cache_ads.go - Updated LaunchPeriodicDirectorTest to use filteredServers map - Updated setupTestDowntimes to properly set filteredServers for tests - Simplified downtime checking logic by removing redundant traversal Co-authored-by: h2zh <[email protected]>
Fixed the downtime checking issue in commit 2c3d8fa by using Regarding the second issue about origins/caches reporting "critical" status when in downtime: This would require changes to the origin/cache monitoring logic in |
Co-authored-by: h2zh <[email protected]>
Applied TTL cache fetching and downtime checking to the new consolidated runDirectorTestCycle structure from PR #2832. Key changes: - Modified LaunchPeriodicDirectorTest to accept URL string parameter - Fetch fresh ServerAd from TTL cache on each test cycle - Check downtime status using isServerInDowntime() before running tests - Stop test suite when ServerAd is no longer in cache - Integrated with new runDirectorTestCycle helper function structure This maintains the original PR goals while working with the improved code structure from the consolidation in PR #2832. Co-authored-by: h2zh <[email protected]>
Rebased changes onto latest main in commit 17856f7. Applied TTL cache fetching and downtime checking logic to the new consolidated |
|
@copilot Fix the linter errors in this CI https://github.com/PelicanPlatform/pelican/actions/runs/20474108967/job/58835307956?pr=2867 Remove |
The isDowntimeActive function was not being used anywhere in the codebase after rebasing onto main. Removed it to address CI linter failures. Co-authored-by: h2zh <[email protected]>
Removed unused |
|
@copilot please run |
Removed trailing whitespace to fix linter formatting check. Co-authored-by: h2zh <[email protected]>
Fix director tests to use TTL cache and respond to downed servers
Status: Ready for Review ✅
This PR fixes two issues where Director tests continued running against old hostnames after server hostname changes, and used inconsistent downtime filtering logic compared to redirect operations.
Latest Changes:
Fixed formatting issues (per @h2zh comment 3688274848):
gofmt -w director/monitor.goto fix formattingFixed linter errors (per @h2zh comment 3688225330):
isDowntimeActivefunctionRebased on latest main (per @h2zh request comment 3688099437):
runDirectorTestCyclehelper function structureDowntime filtering consistency (comment 2624691717):
isServerInDowntime()helper incache_ads.gofilteredServersmap as single source of truthAll Changes:
director/monitor.go:
LaunchPeriodicDirectorTestto accept URL string instead of ServerAd structserverAdsTTL cache on each test cyclefilteredServersmap directly viaisServerInDowntime()before each testrunDirectorTestCyclehelper from PR Fix intermittent Director-test timeout & improve code maintainability #2832isDowntimeActivefunctiondirector/cache_ads.go:
isServerInDowntime()helper functionfilteredServersmap for any filter type excepttempAlloweddirector/monitor_test.go:
Problems Solved:
Technical Details:
Hostname change handling:
nilfrom cache and terminatesDowntime handling:
filteredServersmap directly viaisServerInDowntime()filteredServerswith any filter type excepttempAllowedIntegration with PR #2832 consolidation:
runDirectorTestCyclehelper functionFiles Changed:
director/monitor.go: Updated with TTL cache and downtime logic on new consolidated structuredirector/cache_ads.go: AddedisServerInDowntime()helper (call sites unchanged after rebase)director/monitor_test.go: Reverted to main version (contains integration tests from PR Fix intermittent Director-test timeout & improve code maintainability #2832)Verification:
✅ Code compiles successfully
✅ CodeQL security scan: 0 alerts
✅ Rebased on latest main
✅ Works with PR #2832 consolidation
✅ Uses single source of truth for downtime filtering
✅ Consistent with redirect filtering logic
✅ All original PR goals maintained
✅ Linter errors fixed
✅ Formatting issues fixed with gofmt
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.