Skip to content

Conversation

kostasrim
Copy link
Contributor

@kostasrim kostasrim commented Sep 28, 2025

Following #5774, this PR removes locking the replica_of_mu_ from info command and uses the thread locals && replica of v2 algorithm.

No more locking and blocking for INFO command during replicaof or takeover 🥳 🎉 🌮

@kostasrim kostasrim self-assigned this Sep 28, 2025
@kostasrim kostasrim force-pushed the kpr2 branch 2 times, most recently from 620f35c to 7e36655 Compare September 28, 2025 09:37
}
}

void ServerFamily::RoleV2(CmdArgList args, const CommandContext& cmd_cntx) {
Copy link
Contributor Author

@kostasrim kostasrim Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the regression tests run for some time we will slowly remove Role() and ReplicaOf etc (all the variants of the previous versions)

@kostasrim kostasrim force-pushed the kpr2 branch 2 times, most recently from 7cee991 to 59b3350 Compare September 29, 2025 09:23
int, replica_priority, 100,
"Published by info command for sentinel to pick replica based on score during a failover");
ABSL_FLAG(bool, experimental_replicaof_v2, true,
ABSL_FLAG(bool, experimental_replicaof_v2, false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revert back to true, just want to make sure I did not break anything in case we want to switch back to the old implemntation

ProtocolClient::~ProtocolClient() {
exec_st_.JoinErrorHandler();

// FIXME: We should close the socket explictly outside of the destructor. This currently
Copy link
Contributor Author

@kostasrim kostasrim Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After spending a few hours on this, I still don't understand why, if we keep this code, we get a segfault on the destructor of std::shared_ptr<Replica>. It seems that it happens during the preemption but the sock_ resources are already deallocated so Close() should early return because fd_ < 0.

What is more, the core dump shows that tl_replica and its copy, have a different ref counted object because one shows that it is expired and the other one having a ref count of 7. I added CHECK() before the crash to make sure that both copies of the shared_ptr point to the exact same control block. The checks passed yet the core dump showed otherwise which makes me think that this is somehow a memory corruption error.

The good thing is that we don't need this code anymore, as we handle closing the socket outside of the descturctor now.

While writing this, the only case I can think of is that the last instance of tl_replica gets destructed, but it needs to preempt and and info command comes in and grabs a copy while the shared_ptr is destructing which could lead to a race condition.

I will verify rthis theory once I am back from the holidays.

ps. the test that failed test_cancel_replication_immediately (and every 300 runs so its kinda time consuming to reproduce)

@kostasrim kostasrim requested a review from romange October 2, 2025 07:47
@kostasrim
Copy link
Contributor Author

With all the changes around replication, I will follow up with a tidy PR 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant