-
Notifications
You must be signed in to change notification settings - Fork 1.1k
chore: lock free info command with replicaof v2 #5864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
620f35c
to
7e36655
Compare
} | ||
} | ||
|
||
void ServerFamily::RoleV2(CmdArgList args, const CommandContext& cmd_cntx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the regression tests run for some time we will slowly remove Role()
and ReplicaOf
etc (all the variants of the previous versions)
7cee991
to
59b3350
Compare
Signed-off-by: kostas <[email protected]>
int, replica_priority, 100, | ||
"Published by info command for sentinel to pick replica based on score during a failover"); | ||
ABSL_FLAG(bool, experimental_replicaof_v2, true, | ||
ABSL_FLAG(bool, experimental_replicaof_v2, false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will revert back to true, just want to make sure I did not break anything in case we want to switch back to the old implemntation
ProtocolClient::~ProtocolClient() { | ||
exec_st_.JoinErrorHandler(); | ||
|
||
// FIXME: We should close the socket explictly outside of the destructor. This currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After spending a few hours on this, I still don't understand why, if we keep this code, we get a segfault on the destructor of std::shared_ptr<Replica>
. It seems that it happens during the preemption but the sock_
resources are already deallocated
so Close()
should early return because fd_ < 0
.
What is more, the core dump shows that tl_replica
and its copy, have a different ref counted object
because one shows that it is expired and the other one having a ref count of 7. I added CHECK()
before the crash to make sure that both copies of the shared_ptr
point to the exact same control block
. The checks passed yet the core dump showed otherwise which makes me think that this is somehow a memory corruption error.
The good thing is that we don't need this code anymore, as we handle closing the socket outside of the descturctor now.
While writing this, the only case I can think of is that the last instance of tl_replica
gets destructed, but it needs to preempt and and info command
comes in and grabs a copy while the shared_ptr is destructing
which could lead to a race condition.
I will verify rthis theory once I am back from the holidays.
ps. the test that failed test_cancel_replication_immediately (and every 300 runs so its kinda time consuming to reproduce)
With all the changes around replication, I will follow up with a tidy PR 😄 |
Following #5774, this PR removes locking the
replica_of_mu_
from info command and uses the thread locals && replica of v2 algorithm.No more locking and blocking for INFO command during
replicaof
ortakeover
🥳 🎉 🌮