Skip to content

Conversation

@JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Jan 7, 2026

Why are these changes needed?

Enhance log file coverage test for both head and worker pods

The current collector e2e test only checks that at least one non-empty log file exists for the head node. This can miss cases where important logs are missing or where worker logs aren't persisted correctly.

This PR strengthens the test by validating the presence of key log files for both the head and the worker nodes.

NOTE: The head node contains additional logs (e.g., gcs_server.out, dashboard.log) that don't exist on worker ndoes.

Key Changes

  • Verify existence of raylet.out, gcs_server.out, and monitor.out for the head node
  • Verify existence of raylet.out for the worker node

Related issue number

N/A

Related PRs

#4330 makes sure the Ray cluster has at least one worker pod.

Test Result

Screenshot 2026-01-12 at 8 12 00 PM

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Future-Outlier and others added 14 commits January 5, 2026 21:10
Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Jia-Wei Jiang <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Copy link
Contributor Author

@JiangJiaWei1103 JiangJiaWei1103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is based on #4343, we may need to wait for that PR to be merged before merging this one.

// For a Ray cluster with one head node and one worker node, there are two log directories to verify:
// - logs/<headNodeID>/
// - logs/<workerNodeID>/
func assertNonEmptyFileExist(test Test, g *WithT, s3Client *s3.S3, nodeLogDirPrefix string, fileName string) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add this helper to verify specific files exist and have content under certain log directories.

@JiangJiaWei1103 JiangJiaWei1103 marked this pull request as ready for review January 12, 2026 12:39
@JiangJiaWei1103 JiangJiaWei1103 moved this from Work in progress to In review in My Kuberay & Ray Jan 12, 2026
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @seanlai @cheyu to take a look

@Future-Outlier Future-Outlier self-assigned this Jan 13, 2026
Signed-off-by: Future-Outlier <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you!
image

@Future-Outlier Future-Outlier changed the title [Test] [historyserver] [collector] Enhance log file coverage test for both head and worker pods [Test] [historyserver] [collector] Enhance log file coverage test for both head and worker pods and Ensure event type coverage Jan 14, 2026
@JiangJiaWei1103
Copy link
Contributor Author

Hi @Future-Outlier,

Please do a final pass. After the event type coverage e2e is merged, we can solve conflicts here, then merge this one. Thanks!

@Future-Outlier Future-Outlier changed the title [Test] [historyserver] [collector] Enhance log file coverage test for both head and worker pods and Ensure event type coverage [Test] [historyserver] [collector] Enhance log file coverage test for both head and worker pods Jan 15, 2026
Signed-off-by: JiangJiaWei1103 <[email protected]>
@Future-Outlier Future-Outlier moved this from to review to can be merged in @Future-Outlier's kuberay project Jan 15, 2026
…274/e2e-test-head-worker-logs

Signed-off-by: JiangJiaWei1103 <[email protected]>
LogWithTimestamp(test.T(), "Loaded %d events from job_events/%s", len(jobEvents), jobDir)
}

assertAllEventTypesCovered(test, g, uploadedEvents)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Event verification lacks retry logic unlike log verification

Medium Severity

The event verification in verifyS3SessionDirs calls loadRayEventsFromS3, listS3Directories, and assertAllEventTypesCovered without retry logic, while the log file verification correctly uses g.Eventually via assertFileExist. Since S3 uploads are asynchronous, events may not be immediately available after cluster deletion, causing flaky test failures. This inconsistency means event checks fail immediately instead of retrying until the timeout.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants