Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement recorder and replayer #403

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

saza-ku
Copy link
Contributor

@saza-ku saza-ku commented Jan 21, 2025

What type of PR is this?

/area simulator
/kind feature

What this PR does / why we need it:

This PR implements recorder and replayer.

recorder continuously saves events in an external cluster into a JSON file. replayer reproduces these events in the simulator cluster when the simulator starts by reading the JSON file.

Which issue(s) this PR fixes:

Fixes #395

Special notes for your reviewer:

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added area/simulator Issues or PRs related to the simulator. kind/feature Categorizes issue or PR as related to a new feature. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jan 21, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: saza-ku
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 21, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @saza-ku. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 21, 2025
@saza-ku saza-ku marked this pull request as draft January 21, 2025 08:54
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2025
@saza-ku saza-ku marked this pull request as ready for review January 21, 2025 09:17
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2025
@sanposhiho
Copy link
Member

/cc @utam0k @ordovicia

@k8s-ci-robot
Copy link
Contributor

@sanposhiho: GitHub didn't allow me to request PR reviews from the following users: ordovicia.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @utam0k @ordovicia

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested a review from utam0k January 27, 2025 03:46
@sanposhiho
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 27, 2025
Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took a look at the doc.
I'll take a look for the implementation after @utam0k / @ordovicia.


1. Set `true` to `recorderEnabled`.
2. Set the path of the kubeconfig file for your cluster to `KubeConfig`.
- This feature only requires the read permission for events.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it? Looks like recorder.go have an event handers though. I guess the read permissions for all resources to be recorded are required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the read permissions for all resources to be recorded are required.

That's right.
"This feature only requires the read permission for resources." was correct.

Comment on lines 27 to 28
> [!NOTE]
> When a file already exists at `recordedFilePath`, it backs up the file in the same directory adding a timestamp to the filename and creates a new file for recording.
Copy link
Member

@sanposhiho sanposhiho Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need it? I prefer do it simply, either just override the file without the error or make the recorder put out the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it to put out the error.

@@ -0,0 +1,95 @@
# [Beta] Record your real cluster's events and replay them in the simulator

You can record events from your real cluster and replay them in the simulator. This feature is useful for reproducing issues that occur in your real cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rephrase "events" in this doc. It's unclear what it means for first-readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased it to "changes" or "changes in resources". How about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably go more straightforward like "You can record resource addition/update/deletion at your real cluster"

@@ -0,0 +1,95 @@
# [Beta] Record your real cluster's events and replay them in the simulator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to put the link for this page at somewhere (README?) so that people can notice


You can record events from your real cluster and replay them in the simulator. This feature is useful for reproducing issues that occur in your real cluster.

## Record events
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, basically it has two steps, recording resources into JSON file and replaying it on the simulator.
And, people have to record resources by starting the simulator with recorderEnabled, then shutdown the simulator once, and then restart the simulator with the JSON file and replayerEnabled.

The first step looks weird for me because it has to run the simulator, but essentially it just imports the resources from a real cluster and outputs the JSON file; the simulator doesn't come into play.

So... well, I guess, should we create a CLI for it?
It can be recorder (simple single cli) or sched-simulator record (sched-simulator CLI with record subcommand).
Then the flow would be:

  1. Users record the resource creation/update/deletion by sched-simulator record --duration 5m or something.
  2. After 5min, CLI generates JSON file.
  3. Users start the simulator with JSON file and replayerEnabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. I'll create a new cli in simulator/cmd/recorder.

@utam0k
Copy link
Member

utam0k commented Jan 27, 2025

Just took a look at the doc.
I'll take a look for the implementation after @utam0k / @ordovicia.

Sure! I'll review it in this week.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jan 27, 2025

@saza-ku: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kube-scheduler-simulator-backend-lint dcde17d link true /test pull-kube-scheduler-simulator-backend-lint

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

records := []Record{}
for {
select {
case r := <-s.recordCh:
Copy link
Contributor

@ordovicia ordovicia Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, on every event (resource creation/update/delete) this method serializes all events that have happened so far and writes them to a file, which is a high-cost operation.
In addition, recordCh is unbuffered, so RecordEvent will be blocked until record completes handling an inflight event.

I guess we need some optimization like the following, otherwise the recorder would have a limited performance.

  • Asynchronous: Change recordCh to a buffered channel so that RecordEvent will not be blocked.
  • Buffering: Save some records in a buffer and write them together as a chunk to reduce file system operations
  • Incremental: Write each record chunk into separate files to avoid writing all records many times.

@saza-ku
Copy link
Contributor Author

saza-ku commented Jan 29, 2025

@utam0k I have quite a few points to revise, so it's okay to review this PR next week :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/simulator Issues or PRs related to the simulator. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

simulator: record events on prod cluster and replay them on a fake cluster any time
5 participants