-
Notifications
You must be signed in to change notification settings - Fork 683
feat(history server):re-push prev-logs on container restart #4321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(history server):re-push prev-logs on container restart #4321
Conversation
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
| } | ||
| } | ||
|
|
||
| // isFileAlreadyPersisted checks if a file has already been persisted to persist-complete-logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isFileAlreadyPersisted detects files already moved to persist-complete-logs and avoid duplicate uploads.
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for this PR!
do you mind provide a reproduction script like this for me to test on my env more easily?
|
Hi @my-vegetable-has-exploded, thanks for your contribution. Would you mind modifying the PR title and description to clarify that this feature is actually for container restart, not pod restart? I believe this works only for cases in which the collector sidecar fails and recovers. For now, the data is permanently erased if the head pod restarts since we use an If I have misunderstood anything, please correct me. Thanks! |
| r.prevLogsDir = "/tmp/ray/prev-logs" | ||
| r.persistCompleteLogsDir = "/tmp/ray/persist-complete-logs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move these to constant in historyserver/pkg/utils/utils.go similar to KunWuLuan@9b2ca52?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @justinyeh1995,
Thanks for pointing this out. I’m currently working on it (handling all constants at once) and will open a PR soon. I’m not sure whether it should be included here or handled in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem. I wasn't aware of that when I wrote the comment. Is the pr related to any issue though? I think I need to catch up quite a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can refer to this issue. Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for heads up!
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz also add e2e test in this PR.
Get! I would add it tonight. |
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @lorriexingfang to review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider using filepath functions to handle file path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review. There are some same issues in this file. Maybe we can file a new ticket to handle it?
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cursor review
|
bugbot run |
|
cursor review |
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
| persistDir, | ||
| ) | ||
| _, stderr := ExecPodCmd(test, headPod, "ray-head", []string{"sh", "-c", injectCmd}) | ||
| g.Expect(stderr.String()).To(BeEmpty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E2E test race condition between kill and file injection
Medium Severity
The testCollectorResumesUploadsOnRestart test has a race condition. After killing the collector (line 503), the test immediately injects files (lines 508-525) without waiting for the collector to actually stop. Since Kubernetes restarts the collector asynchronously, the collector could restart, scan prev-logs, find the partially-created directory structure, and call os.RemoveAll on the node directory before the inject command completes. This would cause shell commands like echo to fail when writing to a deleted directory, leading to stderr output and test failure at line 526. The comment says "while collector is down" but there's no synchronization to ensure this.
Signed-off-by: my-vegetable-has-exploded <[email protected]>
|
cc @JiangJiaWei1103 @CheyuWu @seanlaii to test this PR locally before you approve this, thank you! |
Signed-off-by: my-vegetable-has-exploded <[email protected]>
| // 2. Inject "leftover" logs into prev-logs via the ray-head container while collector is down. | ||
| // Note: We only inject logs, not node_events, because the collector's prev-logs processing | ||
| // currently only handles the logs directory. node_events are handled by the EventServer separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious. How can we ensure that the injection is completed during the collector downtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Kubernetes restarts the collector immediately after kill 1, I can’t absolutely guarantee the process stays down until injection finishes. But the injection is short-lived, so current approach works well in my local environment.
If you have a better way to handle this, I’d be glad to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
| return | ||
| } | ||
|
|
||
| // Check if this directory has already been processed by checking in persist-complete-logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No changes are needed here. This note is just to clarify that we still need to check for leftover log files since the presence of log directories under persist-complete-logs directory alone does not guarantee that all logs have been fully persisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! If the Collector crashes halfway through a node directory, persist-complete-logs already exists but some logs remain in prev-logs. Keeping this check would cause the Collector to skip the entire directory upon restart, leading to data loss. The current file-level check isFileAlreadyPersisted will handles this "partial success" scenario correctly without duplicate uploads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing it with following codes then you can trigger this case using cd historyserver && go test -v ./pkg/collector/logcollector/runtime/logcollector/ -run TestScanAndProcess.@JiangJiaWei1103
completeDir := filepath.Join(r.persistCompleteLogsDir, sessionID, nodeID, "logs")
if _, err := os.Stat(completeDir); err == nil {
logrus.Infof("Session %s node %s logs already processed, skipping", sessionID, nodeID)
return
}
And e2e test also covers this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thanks for the detailed explanation! I originally just wanted to leave a brief note for maintainers to get why we removed the code snippet here. All tests pass in my local env. Thanks!
| } | ||
|
|
||
| // Walk through the logs directory and process all files | ||
| err := filepath.WalkDir(logsDir, func(path string, info fs.DirEntry, err error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When collector restart, WatchPrevLogsLoops will call processPrevLogsDir, so we can process leftover log files in prev-logs/{sessionID}/{nodeID}/logs/ directories.
Co-authored-by: 江家瑋 <[email protected]> Signed-off-by: yi wang <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Log persistence can fail if the collector container crashes or restarts. To ensure reliability, the collector must resume processing any logs left in the prev-logs directory upon recovery.
In this PR, we enhanced the WatchPrevLogsLoops() function to perform an initial scan of the prev-logs directory at startup before entering the watch loop. This ensures that any session or node folders left from previous runs (e.g., after a container restart) are correctly processed and uploaded.
Why are these changes needed?
Related issue number
Closes #4281
Checks