-
Notifications
You must be signed in to change notification settings - Fork 687
[Test] [historyserver] [collector] Enhance log file coverage test for both head and worker pods #4351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Test] [historyserver] [collector] Enhance log file coverage test for both head and worker pods #4351
Conversation
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Jia-Wei Jiang <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
JiangJiaWei1103
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is based on #4343, we may need to wait for that PR to be merged before merging this one.
| // For a Ray cluster with one head node and one worker node, there are two log directories to verify: | ||
| // - logs/<headNodeID>/ | ||
| // - logs/<workerNodeID>/ | ||
| func assertNonEmptyFileExist(test Test, g *WithT, s3Client *s3.S3, nodeLogDirPrefix string, fileName string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We add this helper to verify specific files exist and have content under certain log directories.
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
|
Hi @Future-Outlier, Please do a final pass. After the event type coverage e2e is merged, we can solve conflicts here, then merge this one. Thanks! |
Signed-off-by: JiangJiaWei1103 <[email protected]>
…274/e2e-test-head-worker-logs Signed-off-by: JiangJiaWei1103 <[email protected]>
| LogWithTimestamp(test.T(), "Loaded %d events from job_events/%s", len(jobEvents), jobDir) | ||
| } | ||
|
|
||
| assertAllEventTypesCovered(test, g, uploadedEvents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Event verification lacks retry logic unlike log verification
Medium Severity
The event verification in verifyS3SessionDirs calls loadRayEventsFromS3, listS3Directories, and assertAllEventTypesCovered without retry logic, while the log file verification correctly uses g.Eventually via assertFileExist. Since S3 uploads are asynchronous, events may not be immediately available after cluster deletion, causing flaky test failures. This inconsistency means event checks fail immediately instead of retrying until the timeout.

Why are these changes needed?
Enhance log file coverage test for both head and worker pods
The current collector e2e test only checks that at least one non-empty log file exists for the head node. This can miss cases where important logs are missing or where worker logs aren't persisted correctly.
This PR strengthens the test by validating the presence of key log files for both the head and the worker nodes.
NOTE: The head node contains additional logs (e.g.,
gcs_server.out,dashboard.log) that don't exist on worker ndoes.Key Changes
raylet.out,gcs_server.out, andmonitor.outfor the head noderaylet.outfor the worker nodeRelated issue number
N/A
Related PRs
#4330 makes sure the Ray cluster has at least one worker pod.
Test Result
Checks