-
Notifications
You must be signed in to change notification settings - Fork 10.2k
stability enhancement during overload conditions #20492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: silentred The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @silentred. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@ahrtr would you mind to take a look? |
099461d to
3a13982
Compare
|
Did this actually happen in production or in dev? Have you tested the patch and seen the expected improvement? |
It happened several times in production. One of cases is that K8S event keys growed to 50Mi, because leases cannot be revoked successfully. This patch has not been deployed to our production yet. This results in leases living longer than it should be. If we patch this PR, then everything works fine. |
|
thx for the feedback, please resolve the review comments |
Oh, that's a lot. I have never seen etcd cluster be stable after crossing 2M. Have you verified that this change really would fix your production issue? It might move the needle, but we are talking here about 25x improvement. |
etcdserver was not working properly in that situation. The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.
This patch has not been deployed to production, it is not a common issue, maybe once half a year. But I think my simulation test above could verify this patch works. |
3a13982 to
4a79972
Compare
That's recommended practice. Her my old talk about this issue https://www.youtube.com/watch?v=aJVMWcVZOPQ&t=4m9s
What test? |
Thanks, I will check it. |
|
/ok-to-test |
|
The PR looks good. The only comment is that there isn't a real performance results comparison. |
I am planning to add a FeatureGate for this. Also, I will make a benchmark comparison with master branch and post it here. Thanks. |
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 24 files with indirect coverage changes @@ Coverage Diff @@
## main #20492 +/- ##
==========================================
+ Coverage 69.12% 69.15% +0.02%
==========================================
Files 422 422
Lines 34826 34835 +9
==========================================
+ Hits 24073 24089 +16
+ Misses 9352 9346 -6
+ Partials 1401 1400 -1 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
4a79972 to
feeb0ae
Compare
|
Did some basic testing and I'm not sure this addresses the main problem. Simple scenario, grant leases with TTL of 5 seconds for 1 minute and see if there are properly revoked. The results:
While I think prioritizing LeaseGrant help somewhat, I'm not sure it's a worth doing a special case that doesn't address the underlying issue. Lease revokation rate is too slow.
Code |
|
Hi @serathius , Thanks for your feedback. I think you are addressing a different problem, and your test is proving this PR having positive effects in some way. Let me explain. Q: Why Q: Why there is no "too many request" under 500 QPS ? Q: Why Q: Why this is a different problem? |
|
@silentred Makes sense. Could you propose a benchmark that would better match the setup you are targeting? I think Kubernetes is allocating up to 5k keys per lease. Would a mix of lease and TXN requests in that proportion work? |
|
I have some new findings by benchmark test. I added following code before the value of benchmark PR: #20819 The result compares LeareRevoke results under different QPS limit in 1min.
Less
|
771a3b3 to
5fd55a2
Compare
|
Please send PR to add a benchmark so we can review it |
|
/retest |
5fd55a2 to
01d9f22
Compare
…ity to be applied under overload conditions Signed-off-by: shenmu.wy <[email protected]>
01d9f22 to
7ef6360
Compare
|
@silentred: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |

The PR is for the proposal #20396
Based on the previous discussion, we agreed that
LeaseRevokeshould have higher priority to be applied, and we remain cautious aboutCompact. Therefore, I've prepared a table comparing the potential impacts of applying and not-applyingCompact.I would like to have some advices that should I add a unit test or an e2e test? I am not quite sure how this feature should be tested.