Skip to content

Conversation

@silentred
Copy link
Contributor

@silentred silentred commented Aug 15, 2025

The PR is for the proposal #20396
Based on the previous discussion, we agreed that LeaseRevoke should have higher priority to be applied, and we remain cautious about Compact. Therefore, I've prepared a table comparing the potential impacts of applying and not-applying Compact.

Request Effect of Apply Effect of No Apply Is Critical Typical Scenario
LeaseRevoke Cleaning a few keys, which is supposed to happen Keys cannot stop growing, maybe finally trigger a crash of server. It takes very long time to reboot because of large db size. Upstream watcher may be oom-killed due to writing too many resources to cache. YES K8S apiserver recording events
Compact Clean obsolete KVs in index and db, but make the applying slower treeIndex and db size growing, may reach disk quota. Requires SREs to recover the service. TBD default periodical compaction

I would like to have some advices that should I add a unit test or an e2e test? I am not quite sure how this feature should be tested.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: silentred
Once this PR has been reviewed and has the lgtm label, please assign fuweid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @silentred. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@silentred
Copy link
Contributor Author

@ahrtr would you mind to take a look?

@silentred silentred force-pushed the stability-enhancement branch from 099461d to 3a13982 Compare August 22, 2025 13:08
@ahrtr
Copy link
Member

ahrtr commented Aug 22, 2025

Did this actually happen in production or in dev? Have you tested the patch and seen the expected improvement?

@silentred
Copy link
Contributor Author

silentred commented Aug 23, 2025

Did this actually happen in production or in dev? Have you tested the patch and seen the expected improvement?

It happened several times in production. One of cases is that K8S event keys growed to 50Mi, because leases cannot be revoked successfully. This patch has not been deployed to our production yet.
I've made a simulation of slow applying with follwing changes. dcc092c

# keep sending PUT request with lease
bin/benchmark put --endpoints=http://xxx:3379 --clients=200 --conns=50 --rate=2000 --total=10000000 --key-space-size=10000000 --lease-reuse

This results in leases living longer than it should be.
image

If we patch this PR, then everything works fine.

@ahrtr
Copy link
Member

ahrtr commented Aug 23, 2025

thx for the feedback, please resolve the review comments

@serathius
Copy link
Member

One of cases is that K8S event keys growed to 50Mi, because leases cannot be revoked successfully. This patch has not been deployed to our production yet.

Oh, that's a lot. I have never seen etcd cluster be stable after crossing 2M. Have you verified that this change really would fix your production issue? It might move the needle, but we are talking here about 25x improvement.

@silentred
Copy link
Contributor Author

Oh, that's a lot. I have never seen etcd cluster be stable after crossing 2M.

etcdserver was not working properly in that situation. The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

Have you verified that this change really would fix your production issue?

This patch has not been deployed to production, it is not a common issue, maybe once half a year. But I think my simulation test above could verify this patch works.

@silentred silentred force-pushed the stability-enhancement branch from 3a13982 to 4a79972 Compare August 24, 2025 07:32
@serathius
Copy link
Member

The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

That's recommended practice. Her my old talk about this issue https://www.youtube.com/watch?v=aJVMWcVZOPQ&t=4m9s

But I think my simulation test above could verify this patch works.

What test?

@silentred
Copy link
Contributor Author

The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

That's recommended practice. Her my old talk about this issue https://www.youtube.com/watch?v=aJVMWcVZOPQ&t=4m9s

But I think my simulation test above could verify this patch works.

What test?

Thanks, I will check it.
Please take a look at this comment #20492 (comment)

@ahrtr
Copy link
Member

ahrtr commented Sep 28, 2025

/ok-to-test

@ahrtr
Copy link
Member

ahrtr commented Sep 28, 2025

The PR looks good.

The only comment is that there isn't a real performance results comparison.

@silentred
Copy link
Contributor Author

The PR looks good.

The only comment is that there isn't a real performance results comparison.

I am planning to add a FeatureGate for this. Also, I will make a benchmark comparison with master branch and post it here. Thanks.

@codecov
Copy link

codecov bot commented Sep 28, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 69.15%. Comparing base (8a4955b) to head (7ef6360).

Files with missing lines Patch % Lines
server/etcdserver/v3_server.go 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
server/etcdserver/util.go 100.00% <100.00%> (ø)
server/features/etcd_features.go 60.00% <ø> (ø)
server/etcdserver/v3_server.go 74.66% <0.00%> (-0.90%) ⬇️

... and 24 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20492      +/-   ##
==========================================
+ Coverage   69.12%   69.15%   +0.02%     
==========================================
  Files         422      422              
  Lines       34826    34835       +9     
==========================================
+ Hits        24073    24089      +16     
+ Misses       9352     9346       -6     
+ Partials     1401     1400       -1     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a4955b...7ef6360. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@serathius
Copy link
Member

serathius commented Sep 29, 2025

Did some basic testing and I'm not sure this addresses the main problem. Simple scenario, grant leases with TTL of 5 seconds for 1 minute and see if there are properly revoked. The results:

  • Above 100 QPS of grant leases the etcd is unable to keep up with revoking the leases.
  • Only after passing 1000 QPS of grant leases returning too many requests errors.
  • Prioritizing LeaseRevokes improves the revoke rate 20-40% but only after passing 1000 qps

While I think prioritizing LeaseGrant help somewhat, I'm not sure it's a worth doing a special case that doesn't address the underlying issue. Lease revokation rate is too slow.

Scenario QPS Sent Leases revoked Improvement Max active leases Handled QPS Leases granted too many requests count
Base 2000 18.22% 95687 1703.3747 106184 6499
Base 1000 24.84% 46674 887.155 55902 1980
Base 500 41.43% 20072 465.3443 30000 0
Base 250 73.63% 5205 241.6778 15000 0
Base 100 99.15% 551 100.0116 6000 0
WithPrioritizeLeaseRevoke 2000 25.50% 40% 54945 990.5109 66328 48786
WithPrioritizeLeaseRevoke 1000 29.79% 20% 38117 737.9423 48527 9848
WithPrioritizeLeaseRevoke 500 41.58% 0% 20025 462.8422 30000 0
WithPrioritizeLeaseRevoke 250 73.45% 0% 5233 241.5084 15000 0
WithPrioritizeLeaseRevoke 100 99.17% 0% 550 100.0115 6000 0

Code

func leaseGrantFunc(cmd *cobra.Command, _ []string) {
	clients := mustCreateClients(totalClients, totalConns)

	r := newReport(cmd.Name())
	qps := 100
	takeN := int(math.Ceil(float64(qps) / 500))
	limiter := rate.NewLimiter(rate.Limit(qps), takeN)

	ctx, cancel := context.WithTimeout(context.Background(), time.Minute)
	defer cancel()
	for i := range totalClients {
		c := clients[i]
		wg.Go(func() {
			for {
				select {
				case <-ctx.Done():
					return
				default:
				}
				err := limiter.WaitN(ctx, takeN)
				if err != nil {
					return
				}
				wg.Go(func() {
					for range takeN {
						select {
						case <-ctx.Done():
							return
						default:
						}
						start := time.Now()
						_, err = c.Grant(context.Background(), 5)
						end := time.Now()
						if err != nil {
							fmt.Printf("%v\n", err)
						}
						r.Results() <- report.Result{Start: start, End: end, Err: err}
					}
				})
			}
		})
	}

	rc := r.Run()
	wg.Wait()
	close(r.Results())
	fmt.Printf("%s", <-rc)
}

@silentred
Copy link
Contributor Author

Hi @serathius , Thanks for your feedback. I think you are addressing a different problem, and your test is proving this PR having positive effects in some way. Let me explain.

Q: Why Above 100 QPS of grant leases the etcd is unable to keep up with revoking the leases.
A: etcdserver has parallelization control of expired lease revocations by maxPendingRevokes=16. 100 QPS of LeaseRevoke is enough for lease creation in most use cases. K8S reuses a lease for all events created in one minute, so its rate is 1/min. Any usage of above 100 QPS of LeaseRevok should be considered as a misuse of etcdserver from my perspective.

Q: Why there is no "too many request" under 500 QPS ?
A: Because maxGapBetweenApplyAndCommitIndex is playing its role. Lease requests are waiting to be applied, meanwhile the un-applied logs length are not reaching maxGapBetweenApplyAndCommitIndex.

Q: Why but only after passing 1000 qps ?
A: Because this PR only aims to ease the problem of flooding "too many request" errors reducing the probability of LeaseRevoke to be applied. "too many request" error is a necessary condition.

Q: Why this is a different problem?
A: You mentioned the problem is "Lease revokation rate is too slow", when LesaeGrant above 100 QPS, which in my case is 1 QPM. And in my case, the problem is flooding Txn causing "too many request" error, causing drops of LeaseRevoke due to maxGapBetweenApplyAndCommitIndex.

@serathius
Copy link
Member

@silentred Makes sense. Could you propose a benchmark that would better match the setup you are targeting? I think Kubernetes is allocating up to 5k keys per lease. Would a mix of lease and TXN requests in that proportion work?

@silentred
Copy link
Contributor Author

silentred commented Oct 14, 2025

I have some new findings by benchmark test. I added following code before if exceedsRequestLimit(ai, ci, &r, true)

func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*applyResult, error) {
// to check the value of "ci - ai"
if isPriorityRequest(&r) {
  s.lg.Info("call revoke lease",
    zap.Uint64("ci - ai", ci-ai),
    zap.Bool("exceed", exceedsRequestLimit(ai, ci, &r, s.FeatureEnabled(features.PriorityRequest))),
  )
}
// ...
}

the value of ci - ai could be bigger than 10000 under high QPS condition, and it rarely be in range of (5000, 5500]. So the factor in ci <= ai+maxGapBetweenApplyAndCommitIndex*110/100 matters a lot. The following benchmark simulates k8s event workload with lease TTL=15s, reuse duration=1s, which means the test allocates one lease per second.

benchmark PR: #20819

The result compares LeareRevoke results under different QPS limit in 1min.

  • Allocated Lease: the total number of leases allocated in 1 min.
  • Left Lease: the number of leases in the end of the test, fetched by LeaseLeases API.
  • Invalid Lease: the number of leases, of which TTL is less than 0, fetched by LesaeTimeToLive API.
  • Revoke Failure: the number of lines in the etcdserver log, including "failed to revoke lease", which is logged if LeaseRevoke failed.

Less Invalid Lease is better. The result shows obvious improvement when factor=2.0 .

Branch Priority Gate Factor QPS Limit Handled QPS Allocated Lease Left Lease Invalid Lease Revoke Failure
master off 7500 7397 60 15 0 0
off 8500 6849 60 18 3 3
off 10000 5393 52 15 4 10
off 12000 5229 45 20 10 14
master on 1.1 7500 7345 60 15 0 0
on 1.1 8500 6778 60 15 0 0
on 1.1 10000 6239 58 21 7 7
on 1.1 12000 5725 49 21 13 19
master on 2.0 8500 5940 58 15 0 0
on 2.0 10000 6246 59 15 0 0
on 2.0 12000 5977 52 14 4 4
release-3.4 on 1.1 10000 2954 37 16 12 12
on 2.0 10000 2991 38 15 2 2

@silentred silentred force-pushed the stability-enhancement branch 3 times, most recently from 771a3b3 to 5fd55a2 Compare October 18, 2025 01:05
@serathius
Copy link
Member

Please send PR to add a benchmark so we can review it

@silentred
Copy link
Contributor Author

/retest

…ity to be applied under overload conditions

Signed-off-by: shenmu.wy <[email protected]>
@k8s-ci-robot
Copy link

k8s-ci-robot commented Nov 1, 2025

@silentred: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-coverage-report 7ef6360 link true /test pull-etcd-coverage-report

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants