Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented Nov 28, 2025

Bring replica rebalancing in line with the candidate filtering strategy
documented in rebalanceStore. Both lease transfers and replica transfers
now map relatively cleanly to the pre-means/post-means comment
on rebalanceStore. I also made both lease and replica transfers
principled about which of the existing replicas they should ignore
the disposition for (existing leaseholder for lease transfers, all
replicas for replica transfers).

Other changes:

  • Rename computeCandidatesForRange to computeCandidatesForReplicaTransfer
  • Consolidate exclusions into single postMeansExclusions parameter
  • Move candidate logging into computeCandidatesForReplicaTransfer
  • Add unit tests for retainReadyReplicaTargetStoresOnly
  • assorted comment improvements and small structural improvements.

Part of #156776.
^-- need to plumb the mmaprototype.Status inputs to mmaAllocator still, both in asim and production code.

Epic: CRDB-55052

@tbg tbg requested review from a team as code owners November 28, 2025 17:03
@tbg tbg marked this pull request as draft November 28, 2025 17:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg changed the title mmaprototype: hoist tracing setup in cluster_state_test mmaprototype: align replica transfer filtering with documented strategy Dec 3, 2025
@tbg tbg force-pushed the store-replica-transfers-cleanup branch 3 times, most recently from 39bd8b5 to 8fdabac Compare December 5, 2025 13:47
@tbg tbg requested review from sumeerbhola and wenyihu6 December 5, 2025 13:49
@tbg tbg marked this pull request as ready for review December 5, 2025 13:49
@tbg
Copy link
Member Author

tbg commented Dec 5, 2025

@wenyihu6 @sumeerbhola this is ready for a first look. It is still a bit removed from being mergeable, but you should be able to give it a high-level overview to see if the general decisions seem sound. The comments should be in pretty good shape, though through several reworkings, some stale bits may remain that I haven't caught yet - if something looks off, this is likely it and please just point it out.

@wenyihu6 ideally you can focus on the details and the exposition - does everything make sense, which parts could be motivated better, et cetera.
@sumeerbhola I left a number of comments on things that I can't rationalize yet. Appreciate your input on those. And of course a high-level look at everything to make sure I'm not wandering off track.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick self-review so reviewers are aware of a few things.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola, @tbg, and @wenyihu6)


pkg/kv/kvserver/allocator/mmaprototype/testdata/cluster_state/rebalance_stores_cpu_replica_lateral.txt line 1 at r4 (raw file):

# This test verifies lateral (within-node) replica transfers are allowed.

I haven't looked at these tests since an earlier version tbh. I don't think they really changed but I need to double check that I have coverage for everything.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 63 at r4 (raw file):

	now time.Time
	// Scratch variables reused across iterations.
	// TODO(tbg): these are a potential source of errors (imagine two nested

Not for here but I want to do this soon.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 647 at r4 (raw file):

		// leaseholder as ill-disposed for the lease and (pre-means) filter then out.
		//
		// TODO(tbg): we filter out the current leaseholder and then add it

@sumeerbhola


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2207 at r4 (raw file):

// considered.
//
// TODO(tbg): understand and explain why the node load summary is in the mix here.

I kinda did this, though I have a question for Sumeer down below. Leaving this comment to remember to delete this line once resolved.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2241 at r4 (raw file):

	// and target that would result from moving the lease.
	//
	// TODO(tbg): extract this into a helper and set it up so that it doesn't

Not for here.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2282 at r4 (raw file):

	}

	// We define targetSummary as a "worst" of the considered load dimesions

typo: dimesions


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2390 at r4 (raw file):

		// the target's by some other local store, we reject the change.
		//
		// TODO(tbg): While the example illustrates that "something had to be

@sumeerbhola


pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 518 at r4 (raw file):

	slices.SortFunc(cands.candidates, func(a, b candidateInfo) int {
		if diversityScoresAlmostEqual(a.diversityScore, b.diversityScore) {
			// TODO(tbg): this sorts by sls, then leasePreferenceIndex. Not sure this is right.

@sumeerbhola


pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 874 at r4 (raw file):

	// Determine which means to use.
	//
	// TODO(tbg): unit testing.

Will do.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @wenyihu6)


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 670 at r4 (raw file):

		// leaseholder is always going to include itself in the mean, even if it
		// is ill-disposed towards leases.
		candsPL = retainReadyLeaseTargetStoresOnly(ctx, candsPL, re.stores, rangeID, store.StoreID)

Need to remove https://github.com/cockroachdb/cockroach/pull/158874/files#r2593251120 after rebasing.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 672 at r4 (raw file):

		candsPL = retainReadyLeaseTargetStoresOnly(ctx, candsPL, re.stores, rangeID, store.StoreID)

		clear(re.scratch.nodes)

Need to re-run entire asim test suite to check for diffs.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola reviewed 1 of 11 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg and @wenyihu6)


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 647 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

@sumeerbhola

This was just how the code happened to be written, since we need the list without the current leaseholder in the later loop that does

for _, cand := range cands ...

and the cands type is not the same as storeSet so we have to do the work to construct the storeSet anyway.


pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 521 at r4 (raw file):

			// Example:
			// - s1 is loadNormal and has the best lease preference.
			// - s2 is loadLow and has a worse lease preference.

I wasn't aware that we bias more heavily towards lease preferences in SMA.

If we want to sort by lease preference first, we'll need to have some logic to relax that as time elapses and we fail to transfer the lease, akin to how ignoreLevel is used for CPU.
One concern I have about that is that we sometimes want to make lease transfers quickly, due to store being transiently unhealthy. Trying to search for a perfect lease preference candidate can delay that. One could have some urgency parameter. Picking the lowest load is best when urgency is high.


candidateToMoveLease has this comment, which is slightly related

// Should this return the set of candidates which satisfy the first lease
// preference. If none do, the second lease preference, and so on. This
// implies a tighter condition than the current one for candidate selection.
// See existing allocator candidate selection:
// https://github.com/sumeerbhola/cockroach/blob/c4c1dcdeda2c0f38c38270e28535f2139a077ec7/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L2980-L2980
// This may be unnecessarily strict and not essential, since it seems many
// users only set one lease preference.

pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 528 at r4 (raw file):

			// candidate list we are operating on is already filtered down to
			// include only candidates that don't make the lease preference
			// worse. Still, it seems like we ought to be flipping the order

nit: ... worse than the current leaseholder


pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 853 at r4 (raw file):

// mean (they're viable locations in principle) but they're not candidates for
// this specific transfer (the classic case: already have a replica).
func (cs *clusterState) computeCandidatesForReplicaTransfer(

this method was originally written to handle the case for constraint satisfaction too, hence the constraintsDisj and optional loadSheddingStore. If this is being specialized narrowly to replica transfer now, please add a TODO to keep in mind generalizing this later. One of the goals in my mind was not to have duplicated code that has small variations of similar logic.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2392 at r4 (raw file):

		// TODO(tbg): While the example illustrates that "something had to be
		// done", I don't understand why it makes sense to solve this exactly
		// as it was done. The node level summary is based on node-wide CPU

Regarding why this was done exactly this way, it was simple.
I don't remember there being multi-store asim tests, so we can revisit after we add them.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2395 at r4 (raw file):

		// utilization as well as its distance from the mean (across the
		// candidate set). Store summaries a) reflect the worst dimension, and
		// b) on the CPU dimension are based on the store-apportioned capacity.

Is there a claim or question here? The comment ends abruptly.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @wenyihu6)


pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 521 at r4 (raw file):
The comment is extremely related. I think that if we changed candidatesToMoveLease to only return the candidates from the "best" lease preference class then we essentially get the old allocator's behavior back (and the ordering here doesn't matter, since we only have a single lease preference class). I softened the comment somewhat - I don't know that one is better than the other.

I'm not sure I understand your comment about urgency being a factor if we flipped the order. Doesn't the mma consider everyone in the set of candidates? So if we sorted here by lease preference first, it would only heuristically mean that those candidates are considered first. It wouldn't prevent transfers from happening to less-than-optimal preferences in general?

Let me know if that updated comment sounds correct.

attaching some cursor artifact just in case it's helpful.


Old Allocator: Lease Preferences vs Load-Based Decisions

The old allocator prioritizes lease preferences over load-based decisions. This is relevant to the mmaprototype TODO at pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go:517.

Summary

When transferring leases, the old allocator:

  1. First checks if the current leaseholder violates preferences
  2. Filters candidates down to only those matching the best lease preference
  3. Only then applies load-based ranking within that filtered set

This means a store with a worse lease preference is never considered for load balancing if any store with a better preference exists.

Key Code Paths

1. ShouldTransferLease checks preferences first

Lease preferences are evaluated before any load considerations. If violated, returns immediately:

// pkg/kv/kvserver/allocator/allocatorimpl/allocator.go:2852-2855
excludeReplsInNeedOfSnap := a.knobs == nil || !a.knobs.AllowLeaseTransfersToReplicasNeedingSnapshots
if a.LeaseholderShouldMoveDueToPreferences(ctx, storePool, conf, leaseRepl, existing, excludeReplsInNeedOfSnap) {
    return TransferLeaseForPreferences
}

2. ValidLeaseTargets filters to preferred replicas only

If any replica satisfies a lease preference, the candidate set is reduced to only those replicas:

// pkg/kv/kvserver/allocator/allocatorimpl/allocator.go:2259-2264
// Determine which store(s) is preferred based on user-specified preferences.
// If any stores match, only consider those stores as candidates.
preferred := a.PreferredLeaseholders(storePool, conf, candidates)
if len(preferred) > 0 {
    candidates = preferred
}

3. PreferredLeaseholders returns only the highest-priority preference class

Preferences are ordered by priority. As soon as any replicas match a preference, later preferences are ignored:

// pkg/kv/kvserver/allocator/allocatorimpl/allocator.go:3182-3203
func (a Allocator) PreferredLeaseholders(
    storePool storepool.AllocatorStorePool,
    conf *roachpb.SpanConfig,
    existing []roachpb.ReplicaDescriptor,
) []roachpb.ReplicaDescriptor {
    // Go one preference at a time. As soon as we've found replicas that match a
    // preference, we don't need to look at the later preferences, because
    // they're meant to be ordered by priority.
    for _, preference := range conf.LeasePreferences {
        var preferred []roachpb.ReplicaDescriptor
        for _, repl := range existing {
            storeDesc, ok := storePool.GetStoreDescriptor(repl.StoreID)
            if !ok {
                continue
            }
            if constraint.CheckStoreConjunction(storeDesc, preference.Constraints) {
                preferred = append(preferred, repl)
            }
        }
        if len(preferred) > 0 {
            return preferred
        }
    }
    return nil
}

4. TransferLeaseTarget uses filtered candidates for load-based decisions

The load-based logic (FollowTheWorkload, LeaseCountConvergence, LoadConvergence) operates on validTargets, which has already been filtered by ValidLeaseTargets:

// pkg/kv/kvserver/allocator/allocatorimpl/allocator.go:2491-2500
validTargets := a.ValidLeaseTargets(ctx, storePool, desc, conf, existing, leaseRepl, opts)

// Short-circuit if there are no valid targets out there.
if len(validTargets) == 0 || (len(validTargets) == 1 && validTargets[0].StoreID == leaseRepl.StoreID()) {
    log.KvDistribution.VEventf(ctx, 2, "no lease transfer target found for r%d", leaseRepl.GetRangeID())
    return roachpb.ReplicaDescriptor{}
}

switch g := opts.Goal; g {
case allocator.FollowTheWorkload:
    // ... load-based logic operates on already-filtered validTargets

Implication for mmaprototype

The mmaprototype TODO notes:

Example:

  • s1 is loadNormal and has the best lease preference.
  • s2 is loadLow and has a worse lease preference.
    This code will order s2 before s1, so we may pick s2...

In the old allocator, this situation cannot occur: s2 would be filtered out entirely by ValidLeaseTargets before load-based scoring runs. The mmaprototype's current sls, then leasePreferenceIndex ordering is inverted relative to the old allocator's behavior.

To match the old allocator, the mmaprototype should:

  1. Filter candidates to only the best lease preference class first
  2. Then apply load-based ranking within that class

pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 853 at r4 (raw file):

Previously, sumeerbhola wrote…

this method was originally written to handle the case for constraint satisfaction too, hence the constraintsDisj and optional loadSheddingStore. If this is being specialized narrowly to replica transfer now, please add a TODO to keep in mind generalizing this later. One of the goals in my mind was not to have duplicated code that has small variations of similar logic.

Yes, I figured some of this code had been written with generalizations in mind but IMO it is better to keep this specific to what we're doing today, as unused code paths are something we pay a cognitive tax for.
Re: the TODO, I agree with not needlessly repeating code, but covering multiple different tasks in one code path can be worse than duplicating some parts, and so I would like to leave the decision on how to structure new functionality to the point in time at which we're considering actually adding it. Small variations in code can add a lot of mental load and make it difficult to correctly reason about what's going on. We already have means which are across different sets and we have different load dimensions, all of which need to fit into the head at once. So small variations inside a method are something I'm looking to minimize.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2392 at r4 (raw file):

Previously, sumeerbhola wrote…

Regarding why this was done exactly this way, it was simple.
I don't remember there being multi-store asim tests, so we can revisit after we add them.

Done.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2395 at r4 (raw file):
It's more a prompt to provide a better explanation than a question:

  // done", I don't understand why it makes sense to solve this exactly
	// as it was done. The node level summary is based on node-wide CPU

I could understand intuitively we we'd compare the CPU-level summary of the store with the CPU-level summary of the node - CPU is a shared resource and the example illustrates why we need to compare. The nls reflects the node-level CPU util well, but the sls is the max over all dimensions for the store. It doesn't follow from the example that this is required. If we used the CPU-level summary for the store here, would that be cleaner? Note that targetSummary (which is not what we're using here) is (for lease transfers) just the summary for the CPU dimension. If sls is driven by a non-CPU dimension, won't we sometimes hit this check but not for a good reason?


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 647 at r4 (raw file):

Previously, sumeerbhola wrote…

This was just how the code happened to be written, since we need the list without the current leaseholder in the later loop that does

for _, cand := range cands ...

and the cands type is not the same as storeSet so we have to do the work to construct the storeSet anyway.

Ah, right - I missed that cands doesn't contain it. Will update the comment.

@tbg tbg force-pushed the store-replica-transfers-cleanup branch from 8fdabac to 7bc5dd9 Compare December 8, 2025 16:45
@blathers-crl
Copy link

blathers-crl bot commented Dec 8, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed the update.

First commit (health default on and asim diff) is from #158977. There's no asim diff after that.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @wenyihu6)

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r4-r8 is the actual diff in Reviewable btw (I rebased)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @wenyihu6)

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I mean incremental diff for @sumeerbhola )

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @wenyihu6)

@tbg tbg requested a review from sumeerbhola December 8, 2025 16:51
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@sumeerbhola reviewed 1 of 9 files at r7, 3 of 11 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg and @wenyihu6)


pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 521 at r4 (raw file):

I'm not sure I understand your comment about urgency being a factor if we flipped the order. Doesn't the mma consider everyone in the set of candidates? So if we sorted here by lease preference first, it would only heuristically mean that those candidates are considered first. It wouldn't prevent transfers from happening to less-than-optimal preferences in general?

All of the sorting here is mostly to eliminate candidates, since what remains finally feeds into a random selection j = rng.Intn(j). We can of course change the algorithm to not use a uniform distribution at the end, but I haven't given that any thought.

Another thought here is that rebalancing should not need to improve lease preference satisfaction: we are here because the source is overloaded or wanting to shed leases for some reason. Doing that work without making the lease worse is sufficient. Improving the lease preference should be the job of the lease queue.

Of course, if the SMA behavior of only considering the best lease preference is fine, we could just do that before calling sortTargetCandidateSetAndPick. Is that what is happening in SMA for replica transfers too (where the source is the leaseholder)?
We would need to change the ordering of the filtering in MMA. For example, in one case in SMA we do:

	// Exclude suspect/draining/dead stores.
	candidates, _ := storePool.LiveAndDeadReplicas(
		allExistingReplicas, false, /* includeSuspectAndDrainingStores */
	)
	// If there are any replicas that do match lease preferences, then we check if
	// the existing leaseholder is one of them.
	preferred := a.PreferredLeaseholders(storePool, conf, candidates)

That is, we are filtering out the ones that are not disposition-ok and then in the remaining finding the best lease preference set. While in MMA we are doing:

		cands, _ := rstate.constraints.candidatesToMoveLease()
...
		candsPL = retainReadyLeaseTargetStoresOnly(ctx, candsPL, re.stores, rangeID)

So we could pick the best first and filter all of them out.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2395 at r4 (raw file):

The nls reflects the node-level CPU util well, but the sls is the max over all dimensions for the store. It doesn't follow from the example that this is required. If we used the CPU-level summary for the store here, would that be cleaner?

This is a good point. It is possible this was just a quick oversight. On the other hand, we want to be as permissive as possible to avoid getting stuck in a local minima. Using the worst value on the RHS of the <= expression makes for more permissiveness. I would lean towards keeping this expression and adding a comment about permissiveness.


pkg/kv/kvserver/asim/tests/testdata/non_rand/mma/high_cpu.txt line 64 at r8 (raw file):

eval duration=20m samples=1 seed=42 cfgs=(mma-count,mma-only) metrics=(cpu,cpu_util,replicas,leases)
----

any idea why these asim tests have different output? I thought we had no unhealthy stores, and for that case we wouldn't have a change in behavior.

Copy link
Contributor

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2373 at r7 (raw file):

		// comparing targetSLS with itself.
		//
		// Consider a node that has two stores:

This makes sense for CPU since it is shared resource. Does this reasoning apply for other store specific resource as well?


pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 865 at r8 (raw file):

	// we will recompute the means again below.
	cs.scratchDisj[0] = conj
	means := cs.meansMemo.getMeans(cs.scratchDisj[:1])

To confirm my understanding: we calculate means first only for the potential caching optimization. Otherwise, we could just compute the mean after the filtered store state?


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 779 at r7 (raw file):

			// Example: Consider a range with leaseholder on s1 and voters on s2
			// and s3. s1 is the leaseholder and the store uses 40% of its CPU.
			// If s2 and s3 are at 80% CPU, the mean CPU utilization is 66% if

I don’t think it affects anything since the same overall reasoning applies, but one small nit: mean CPU utilization isn’t simply (40 + 80 + 80) / 3, but instead computed as load divided by capacity. I pushed a commit to change this example.

tbg added 3 commits December 18, 2025 21:20
- Improve documentation and TODOs
- Skip leaseholder for disposition checks
Previously, computeCandidatesForRange (now computeCandidatesForReplicaTransfer)
computed the mean over all constraint-satisfying stores, then applied
post-means exclusions before returning candidates. These exclusions were:
- Stores on nodes already housing replicas of the range
- The shedding store itself (handled via a separate check in the loop)
- If the shedding store's node was CPU overloaded, all other stores on
  that node

The last bullet was ad-hoc and asymmetric: we wouldn't apply the same logic
to other overloaded nodes that happened to have replicas. It also wasn't
exercised, and the downstream candidate selection logic wouldn't choose these
stores anyway (they'd be rejected for being on an overloaded node).

This commit makes two changes:

1. Simplified post-means exclusions: The parameter is renamed from
   storesToExclude to postMeansExclusions to clarify its role. All exclusions
   are now handled uniformly in this set:
   - The shedding store (we're moving away from it)
   - Stores on nodes with other existing replicas

   The ad-hoc "exclude other stores on shedding node if node is overloaded"
   logic is removed. Within-node rebalance (to another store on the shedding
   node) is now permitted at this stage.

2. Pre-means filtering: This is new. Before computing the mean, we now filter
   out stores that aren't ready to receive replicas based on disposition.
   As a legacy case, we manually filter on high disk utilization, but there
   is a TODO to fold this into the disposition as well - this is tracked and
   will be done separately.

   This is done by retainReadyReplicaTargetStoresOnly. The shedding store
   bypasses the disposition check since it already has the replica - its load
   should be in the mean regardless of its disposition for NEW replicas.

   When filtering occurs, we recompute the mean over the filtered set. The
   common case (nothing filtered) still uses the cached mean for efficiency.

Other cleanup:
- Fixed logging (VEventf instead of mixing V() and Infof)
- Added scratch fields to clusterState for the new filtering but also left
  a TODO to get out of scratch fields entirely.

Part of cockroachdb#156776.

Epic: CRDB-55052
@wenyihu6 wenyihu6 force-pushed the store-replica-transfers-cleanup branch from 62d25a6 to 15843b6 Compare December 19, 2025 02:44
@wenyihu6
Copy link
Contributor

Since there were no blocking comments, I’d like to get this merged. I’ll take a closer look and follow up on the comments afterward. There were some non-trivial merge conflicts - hope I resolved them correctly. I made a backup before the last force-push in case anything went wrong https://github.com/cockroachdb/cockroach/compare/master...wenyihu6:cockroach:copied-158473?expand=1.

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 19, 2025

@craig craig bot merged commit 43add13 into cockroachdb:master Dec 19, 2025
23 of 24 checks passed
@wenyihu6
Copy link
Contributor

blathers backport 26.1

Copy link
Contributor

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenyihu6 made 3 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale).


pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go line 521 at r4 (raw file):

Previously, sumeerbhola wrote…

I'm not sure I understand your comment about urgency being a factor if we flipped the order. Doesn't the mma consider everyone in the set of candidates? So if we sorted here by lease preference first, it would only heuristically mean that those candidates are considered first. It wouldn't prevent transfers from happening to less-than-optimal preferences in general?

All of the sorting here is mostly to eliminate candidates, since what remains finally feeds into a random selection j = rng.Intn(j). We can of course change the algorithm to not use a uniform distribution at the end, but I haven't given that any thought.

Another thought here is that rebalancing should not need to improve lease preference satisfaction: we are here because the source is overloaded or wanting to shed leases for some reason. Doing that work without making the lease worse is sufficient. Improving the lease preference should be the job of the lease queue.

Of course, if the SMA behavior of only considering the best lease preference is fine, we could just do that before calling sortTargetCandidateSetAndPick. Is that what is happening in SMA for replica transfers too (where the source is the leaseholder)?
We would need to change the ordering of the filtering in MMA. For example, in one case in SMA we do:

	// Exclude suspect/draining/dead stores.
	candidates, _ := storePool.LiveAndDeadReplicas(
		allExistingReplicas, false, /* includeSuspectAndDrainingStores */
	)
	// If there are any replicas that do match lease preferences, then we check if
	// the existing leaseholder is one of them.
	preferred := a.PreferredLeaseholders(storePool, conf, candidates)

That is, we are filtering out the ones that are not disposition-ok and then in the remaining finding the best lease preference set. While in MMA we are doing:

		cands, _ := rstate.constraints.candidatesToMoveLease()
...
		candsPL = retainReadyLeaseTargetStoresOnly(ctx, candsPL, re.stores, rangeID)

So we could pick the best first and filter all of them out.

Documenting the gap in - #160015.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2395 at r4 (raw file):

Previously, sumeerbhola wrote…

The nls reflects the node-level CPU util well, but the sls is the max over all dimensions for the store. It doesn't follow from the example that this is required. If we used the CPU-level summary for the store here, would that be cleaner?

This is a good point. It is possible this was just a quick oversight. On the other hand, we want to be as permissive as possible to avoid getting stuck in a local minima. Using the worst value on the RHS of the <= expression makes for more permissiveness. I would lean towards keeping this expression and adding a comment about permissiveness.

Documenting the gap here #160015.


pkg/kv/kvserver/asim/tests/testdata/non_rand/mma/high_cpu.txt line 64 at r8 (raw file):

Previously, sumeerbhola wrote…

any idea why these asim tests have different output? I thought we had no unhealthy stores, and for that case we wouldn't have a change in behavior.

This diff went away when this got merged. Possibly due to 53ec734.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants