Skip to content

Conversation

@my-vegetable-has-exploded
Copy link
Contributor

@my-vegetable-has-exploded my-vegetable-has-exploded commented Dec 30, 2025

Log persistence can fail if the collector container crashes or restarts. To ensure reliability, the collector must resume processing any logs left in the prev-logs directory upon recovery.

In this PR, we enhanced the WatchPrevLogsLoops() function to perform an initial scan of the prev-logs directory at startup before entering the watch loop. This ensures that any session or node folders left from previous runs (e.g., after a container restart) are correctly processed and uploaded.

Why are these changes needed?

Related issue number

Closes #4281

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
@my-vegetable-has-exploded my-vegetable-has-exploded marked this pull request as ready for review December 30, 2025 11:30
}
}

// isFileAlreadyPersisted checks if a file has already been persisted to persist-complete-logs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFileAlreadyPersisted detects files already moved to persist-complete-logs and avoid duplicate uploads.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for this PR!
do you mind provide a reproduction script like this for me to test on my env more easily?

https://github.com/ray-project/kuberay/blob/master/historyserver/docs/set_up_collector.md#test-the-collector-on-the-kind-cluster

@JiangJiaWei1103
Copy link
Contributor

JiangJiaWei1103 commented Jan 1, 2026

Hi @my-vegetable-has-exploded, thanks for your contribution.

Would you mind modifying the PR title and description to clarify that this feature is actually for container restart, not pod restart? I believe this works only for cases in which the collector sidecar fails and recovers. For now, the data is permanently erased if the head pod restarts since we use an emptyDir volume.

If I have misunderstood anything, please correct me. Thanks!

Comment on lines 52 to +53
r.prevLogsDir = "/tmp/ray/prev-logs"
r.persistCompleteLogsDir = "/tmp/ray/persist-complete-logs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move these to constant in historyserver/pkg/utils/utils.go similar to KunWuLuan@9b2ca52?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @justinyeh1995,

Thanks for pointing this out. I’m currently working on it (handling all constants at once) and will open a PR soon. I’m not sure whether it should be included here or handled in a separate PR.

Copy link
Contributor

@justinyeh1995 justinyeh1995 Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem. I wasn't aware of that when I wrote the comment. Is the pr related to any issue though? I think I need to catch up quite a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to this issue. Thanks a lot!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for heads up!

@my-vegetable-has-exploded my-vegetable-has-exploded changed the title feat(historyserver):re-push prev-logs on pod restart feat(history server):re-push prev-logs on container restart Jan 4, 2026
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz also add e2e test in this PR.

@my-vegetable-has-exploded
Copy link
Contributor Author

plz also add e2e test in this PR.

Get! I would add it tonight.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lorriexingfang to review

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider using filepath functions to handle file path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review. There are some same issues in this file. Maybe we can file a new ticket to handle it?

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cursor review

@Future-Outlier
Copy link
Member

bugbot run

@Future-Outlier
Copy link
Member

cursor review

Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
persistDir,
)
_, stderr := ExecPodCmd(test, headPod, "ray-head", []string{"sh", "-c", injectCmd})
g.Expect(stderr.String()).To(BeEmpty())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E2E test race condition between kill and file injection

Medium Severity

The testCollectorResumesUploadsOnRestart test has a race condition. After killing the collector (line 503), the test immediately injects files (lines 508-525) without waiting for the collector to actually stop. Since Kubernetes restarts the collector asynchronously, the collector could restart, scan prev-logs, find the partially-created directory structure, and call os.RemoveAll on the node directory before the inject command completes. This would cause shell commands like echo to fail when writing to a deleted directory, leading to stderr output and test failure at line 526. The comment says "while collector is down" but there's no synchronization to ensure this.

Fix in Cursor Fix in Web

Signed-off-by: my-vegetable-has-exploded <[email protected]>
@Future-Outlier
Copy link
Member

cc @JiangJiaWei1103 @CheyuWu @seanlaii to test this PR locally before you approve this, thank you!

Signed-off-by: my-vegetable-has-exploded <[email protected]>
Comment on lines 505 to 507
// 2. Inject "leftover" logs into prev-logs via the ray-head container while collector is down.
// Note: We only inject logs, not node_events, because the collector's prev-logs processing
// currently only handles the logs directory. node_events are handled by the EventServer separately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. How can we ensure that the injection is completed during the collector downtime?

Copy link
Contributor Author

@my-vegetable-has-exploded my-vegetable-has-exploded Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Kubernetes restarts the collector immediately after kill 1, I can’t absolutely guarantee the process stays down until injection finishes. But the injection is short-lived, so current approach works well in my local environment.
If you have a better way to handle this, I’d be glad to implement it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the injection completes quickly while the collector takes some time to become ready, and this approach works in my local environment as well. At the moment, I don’t have a more robust solution in mind.

cc @seanlaii @CheyuWu Do you have any suggestions? Thanks so much!

@CheyuWu CheyuWu self-requested a review January 10, 2026 01:49
return
}

// Check if this directory has already been processed by checking in persist-complete-logs
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes are needed here. This note is just to clarify that we still need to check for leftover log files since the presence of log directories under persist-complete-logs directory alone does not guarantee that all logs have been fully persisted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! If the Collector crashes halfway through a node directory, persist-complete-logs already exists but some logs remain in prev-logs. Keeping this check would cause the Collector to skip the entire directory upon restart, leading to data loss. The current file-level check isFileAlreadyPersisted will handles this "partial success" scenario correctly without duplicate uploads.

Copy link
Contributor Author

@my-vegetable-has-exploded my-vegetable-has-exploded Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing it with following codes then you can trigger this case using cd historyserver && go test -v ./pkg/collector/logcollector/runtime/logcollector/ -run TestScanAndProcess.@JiangJiaWei1103

	completeDir := filepath.Join(r.persistCompleteLogsDir, sessionID, nodeID, "logs")
	if _, err := os.Stat(completeDir); err == nil {
		logrus.Infof("Session %s node %s logs already processed, skipping", sessionID, nodeID)
		return
	}

And e2e test also covers this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thanks for the detailed explanation! I originally just wanted to leave a brief note for maintainers to get why we removed the code snippet here. All tests pass in my local env. Thanks!

}

// Walk through the logs directory and process all files
err := filepath.WalkDir(logsDir, func(path string, info fs.DirEntry, err error) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When collector restart, WatchPrevLogsLoops will call processPrevLogsDir, so we can process leftover log files in prev-logs/{sessionID}/{nodeID}/logs/ directories.

Co-authored-by: 江家瑋 <[email protected]>
Signed-off-by: yi wang <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
@Future-Outlier Future-Outlier self-assigned this Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[history server][collector] Implement strategy for handling duplicate logs on pod restart

5 participants