stability enhancement during overload conditions #20492

silentred · 2025-08-15T07:39:47Z

The PR is for the proposal #20396
Based on the previous discussion, we agreed that LeaseRevoke should have higher priority to be applied, and we remain cautious about Compact. Therefore, I've prepared a table comparing the potential impacts of applying and not-applying Compact.

Request	Effect of Apply	Effect of No Apply	Is Critical	Typical Scenario
LeaseRevoke	Cleaning a few keys, which is supposed to happen	Keys cannot stop growing, maybe finally trigger a crash of server. It takes very long time to reboot because of large db size. Upstream watcher may be oom-killed due to writing too many resources to cache.	YES	K8S apiserver recording events
Compact	Clean obsolete KVs in index and db, but make the applying slower	treeIndex and db size growing, may reach disk quota. Requires SREs to recover the service.	TBD	default periodical compaction

I would like to have some advices that should I add a unit test or an e2e test? I am not quite sure how this feature should be tested.

k8s-ci-robot · 2025-08-15T07:39:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: silentred
Once this PR has been reviewed and has the lgtm label, please assign fuweid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-08-15T07:39:57Z

Hi @silentred. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

silentred · 2025-08-22T11:19:56Z

@ahrtr would you mind to take a look?

server/etcdserver/v3_server.go

server/etcdserver/util_test.go

server/etcdserver/util.go

ahrtr · 2025-08-22T14:44:29Z

Did this actually happen in production or in dev? Have you tested the patch and seen the expected improvement?

silentred · 2025-08-23T12:02:40Z

Did this actually happen in production or in dev? Have you tested the patch and seen the expected improvement?

It happened several times in production. One of cases is that K8S event keys growed to 50Mi, because leases cannot be revoked successfully. This patch has not been deployed to our production yet.
I've made a simulation of slow applying with follwing changes. dcc092c

# keep sending PUT request with lease
bin/benchmark put --endpoints=http://xxx:3379 --clients=200 --conns=50 --rate=2000 --total=10000000 --key-space-size=10000000 --lease-reuse

This results in leases living longer than it should be.

If we patch this PR, then everything works fine.

ahrtr · 2025-08-23T12:04:41Z

thx for the feedback, please resolve the review comments

serathius · 2025-08-23T12:45:24Z

One of cases is that K8S event keys growed to 50Mi, because leases cannot be revoked successfully. This patch has not been deployed to our production yet.

Oh, that's a lot. I have never seen etcd cluster be stable after crossing 2M. Have you verified that this change really would fix your production issue? It might move the needle, but we are talking here about 25x improvement.

silentred · 2025-08-23T13:30:53Z

Oh, that's a lot. I have never seen etcd cluster be stable after crossing 2M.

etcdserver was not working properly in that situation. The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

Have you verified that this change really would fix your production issue?

This patch has not been deployed to production, it is not a common issue, maybe once half a year. But I think my simulation test above could verify this patch works.

serathius · 2025-08-24T09:38:31Z

The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

That's recommended practice. Her my old talk about this issue https://www.youtube.com/watch?v=aJVMWcVZOPQ&t=4m9s

But I think my simulation test above could verify this patch works.

What test?

silentred · 2025-08-24T09:46:24Z

The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

That's recommended practice. Her my old talk about this issue https://www.youtube.com/watch?v=aJVMWcVZOPQ&t=4m9s

But I think my simulation test above could verify this patch works.

What test?

Thanks, I will check it.
Please take a look at this comment #20492 (comment)

ahrtr · 2025-09-28T09:04:09Z

/ok-to-test

ahrtr · 2025-09-28T09:07:24Z

The PR looks good.

The only comment is that there isn't a real performance results comparison.

silentred · 2025-09-28T09:13:13Z

The PR looks good.

The only comment is that there isn't a real performance results comparison.

I am planning to add a FeatureGate for this. Also, I will make a benchmark comparison with master branch and post it here. Thanks.

codecov · 2025-09-28T09:25:03Z

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 69.15%. Comparing base (8a4955b) to head (7ef6360).

Files with missing lines	Patch %	Lines
server/etcdserver/v3_server.go	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
server/etcdserver/util.go	`100.00% <100.00%> (ø)`
server/features/etcd_features.go	`60.00% <ø> (ø)`
server/etcdserver/v3_server.go	`74.66% <0.00%> (-0.90%)`	⬇️

... and 24 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20492      +/-   ##
==========================================
+ Coverage   69.12%   69.15%   +0.02%     
==========================================
  Files         422      422              
  Lines       34826    34835       +9     
==========================================
+ Hits        24073    24089      +16     
+ Misses       9352     9346       -6     
+ Partials     1401     1400       -1

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a4955b...7ef6360. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

serathius · 2025-09-29T12:19:35Z

Did some basic testing and I'm not sure this addresses the main problem. Simple scenario, grant leases with TTL of 5 seconds for 1 minute and see if there are properly revoked. The results:

Above 100 QPS of grant leases the etcd is unable to keep up with revoking the leases.
Only after passing 1000 QPS of grant leases returning too many requests errors.
Prioritizing LeaseRevokes improves the revoke rate 20-40% but only after passing 1000 qps

While I think prioritizing LeaseGrant help somewhat, I'm not sure it's a worth doing a special case that doesn't address the underlying issue. Lease revokation rate is too slow.

Scenario	QPS Sent	Leases revoked	Improvement	Max active leases	Handled QPS	Leases granted	`too many requests` count
Base	2000	18.22%		95687	1703.3747	106184	6499
Base	1000	24.84%		46674	887.155	55902	1980
Base	500	41.43%		20072	465.3443	30000	0
Base	250	73.63%		5205	241.6778	15000	0
Base	100	99.15%		551	100.0116	6000	0
WithPrioritizeLeaseRevoke	2000	25.50%	40%	54945	990.5109	66328	48786
WithPrioritizeLeaseRevoke	1000	29.79%	20%	38117	737.9423	48527	9848
WithPrioritizeLeaseRevoke	500	41.58%	0%	20025	462.8422	30000	0
WithPrioritizeLeaseRevoke	250	73.45%	0%	5233	241.5084	15000	0
WithPrioritizeLeaseRevoke	100	99.17%	0%	550	100.0115	6000	0

Code

func leaseGrantFunc(cmd *cobra.Command, _ []string) {
	clients := mustCreateClients(totalClients, totalConns)

	r := newReport(cmd.Name())
	qps := 100
	takeN := int(math.Ceil(float64(qps) / 500))
	limiter := rate.NewLimiter(rate.Limit(qps), takeN)

	ctx, cancel := context.WithTimeout(context.Background(), time.Minute)
	defer cancel()
	for i := range totalClients {
		c := clients[i]
		wg.Go(func() {
			for {
				select {
				case <-ctx.Done():
					return
				default:
				}
				err := limiter.WaitN(ctx, takeN)
				if err != nil {
					return
				}
				wg.Go(func() {
					for range takeN {
						select {
						case <-ctx.Done():
							return
						default:
						}
						start := time.Now()
						_, err = c.Grant(context.Background(), 5)
						end := time.Now()
						if err != nil {
							fmt.Printf("%v\n", err)
						}
						r.Results() <- report.Result{Start: start, End: end, Err: err}
					}
				})
			}
		})
	}

	rc := r.Run()
	wg.Wait()
	close(r.Results())
	fmt.Printf("%s", <-rc)
}

silentred · 2025-10-03T02:45:47Z

Hi @serathius , Thanks for your feedback. I think you are addressing a different problem, and your test is proving this PR having positive effects in some way. Let me explain.

Q: Why Above 100 QPS of grant leases the etcd is unable to keep up with revoking the leases.
A: etcdserver has parallelization control of expired lease revocations by maxPendingRevokes=16. 100 QPS of LeaseRevoke is enough for lease creation in most use cases. K8S reuses a lease for all events created in one minute, so its rate is 1/min. Any usage of above 100 QPS of LeaseRevok should be considered as a misuse of etcdserver from my perspective.

Q: Why there is no "too many request" under 500 QPS ?
A: Because maxGapBetweenApplyAndCommitIndex is playing its role. Lease requests are waiting to be applied, meanwhile the un-applied logs length are not reaching maxGapBetweenApplyAndCommitIndex.

Q: Why but only after passing 1000 qps ?
A: Because this PR only aims to ease the problem of flooding "too many request" errors reducing the probability of LeaseRevoke to be applied. "too many request" error is a necessary condition.

Q: Why this is a different problem?
A: You mentioned the problem is "Lease revokation rate is too slow", when LesaeGrant above 100 QPS, which in my case is 1 QPM. And in my case, the problem is flooding Txn causing "too many request" error, causing drops of LeaseRevoke due to maxGapBetweenApplyAndCommitIndex.

serathius · 2025-10-03T09:16:10Z

@silentred Makes sense. Could you propose a benchmark that would better match the setup you are targeting? I think Kubernetes is allocating up to 5k keys per lease. Would a mix of lease and TXN requests in that proportion work?

silentred · 2025-10-14T13:14:17Z

I have some new findings by benchmark test. I added following code before if exceedsRequestLimit(ai, ci, &r, true)

func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*applyResult, error) {
// to check the value of "ci - ai"
if isPriorityRequest(&r) {
  s.lg.Info("call revoke lease",
    zap.Uint64("ci - ai", ci-ai),
    zap.Bool("exceed", exceedsRequestLimit(ai, ci, &r, s.FeatureEnabled(features.PriorityRequest))),
  )
}
// ...
}

the value of ci - ai could be bigger than 10000 under high QPS condition, and it rarely be in range of (5000, 5500]. So the factor in ci <= ai+maxGapBetweenApplyAndCommitIndex*110/100 matters a lot. The following benchmark simulates k8s event workload with lease TTL=15s, reuse duration=1s, which means the test allocates one lease per second.

benchmark PR: #20819

The result compares LeareRevoke results under different QPS limit in 1min.

Allocated Lease: the total number of leases allocated in 1 min.
Left Lease: the number of leases in the end of the test, fetched by LeaseLeases API.
Invalid Lease: the number of leases, of which TTL is less than 0, fetched by LesaeTimeToLive API.
Revoke Failure: the number of lines in the etcdserver log, including "failed to revoke lease", which is logged if LeaseRevoke failed.

Less Invalid Lease is better. The result shows obvious improvement when factor=2.0 .

Branch	Priority Gate	Factor	QPS Limit	Handled QPS	Allocated Lease	Left Lease	Invalid Lease	Revoke Failure
master	off		7500	7397	60	15	0	0
	off		8500	6849	60	18	3	3
	off		10000	5393	52	15	4	10
	off		12000	5229	45	20	10	14
master	on	1.1	7500	7345	60	15	0	0
	on	1.1	8500	6778	60	15	0	0
	on	1.1	10000	6239	58	21	7	7
	on	1.1	12000	5725	49	21	13	19
master	on	2.0	8500	5940	58	15	0	0
	on	2.0	10000	6246	59	15	0	0
	on	2.0	12000	5977	52	14	4	4
release-3.4	on	1.1	10000	2954	37	16	12	12
	on	2.0	10000	2991	38	15	2	2

serathius · 2025-10-18T07:04:54Z

Please send PR to add a benchmark so we can review it

silentred · 2025-10-20T11:24:21Z

/retest

…ity to be applied under overload conditions Signed-off-by: shenmu.wy <[email protected]>

k8s-ci-robot · 2025-11-01T01:58:27Z

@silentred: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-coverage-report	`7ef6360`	link	true	`/test pull-etcd-coverage-report`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added the needs-ok-to-test label Aug 15, 2025

k8s-ci-robot added the size/S label Aug 15, 2025

serathius reviewed Aug 22, 2025

View reviewed changes

server/etcdserver/v3_server.go Outdated Show resolved Hide resolved

silentred force-pushed the stability-enhancement branch from 099461d to 3a13982 Compare August 22, 2025 13:08

k8s-ci-robot added size/M and removed size/S labels Aug 22, 2025

ahrtr reviewed Aug 22, 2025

View reviewed changes

server/etcdserver/util_test.go Outdated Show resolved Hide resolved

server/etcdserver/util_test.go Outdated Show resolved Hide resolved

server/etcdserver/util.go Outdated Show resolved Hide resolved

silentred force-pushed the stability-enhancement branch from 3a13982 to 4a79972 Compare August 24, 2025 07:32

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Sep 28, 2025

silentred force-pushed the stability-enhancement branch from 4a79972 to feeb0ae Compare September 29, 2025 04:10

k8s-ci-robot added size/L and removed size/M labels Sep 29, 2025

silentred force-pushed the stability-enhancement branch 3 times, most recently from 771a3b3 to 5fd55a2 Compare October 18, 2025 01:05

silentred mentioned this pull request Oct 20, 2025

[benchmark] a benchmark for k8s event workload #20819

Open

k8s-ci-robot added the needs-rebase label Oct 31, 2025

silentred force-pushed the stability-enhancement branch from 5fd55a2 to 01d9f22 Compare November 1, 2025 01:22

add PriorityRequest FeatureGate to make LeaseRevoke have higher prior…

7ef6360

…ity to be applied under overload conditions Signed-off-by: shenmu.wy <[email protected]>

silentred force-pushed the stability-enhancement branch from 01d9f22 to 7ef6360 Compare November 1, 2025 01:26

k8s-ci-robot removed the needs-rebase label Nov 1, 2025

stability enhancement during overload conditions #20492

Are you sure you want to change the base?

stability enhancement during overload conditions #20492

Uh oh!

Conversation

silentred commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 15, 2025

Uh oh!

k8s-ci-robot commented Aug 15, 2025

Uh oh!

silentred commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ahrtr commented Aug 22, 2025

Uh oh!

silentred commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahrtr commented Aug 23, 2025

Uh oh!

serathius commented Aug 23, 2025

Uh oh!

silentred commented Aug 23, 2025

Uh oh!

serathius commented Aug 24, 2025

Uh oh!

silentred commented Aug 24, 2025

Uh oh!

ahrtr commented Sep 28, 2025

Uh oh!

ahrtr commented Sep 28, 2025

Uh oh!

silentred commented Sep 28, 2025

Uh oh!

codecov bot commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

serathius commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silentred commented Oct 3, 2025

Uh oh!

serathius commented Oct 3, 2025

Uh oh!

silentred commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serathius commented Oct 18, 2025

Uh oh!

silentred commented Oct 20, 2025

Uh oh!

k8s-ci-robot commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

silentred commented Aug 15, 2025 •

edited

Loading

silentred commented Aug 23, 2025 •

edited

Loading

codecov bot commented Sep 28, 2025 •

edited

Loading

serathius commented Sep 29, 2025 •

edited

Loading

silentred commented Oct 14, 2025 •

edited

Loading

k8s-ci-robot commented Nov 1, 2025 •

edited

Loading