Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(duplication): fix plog do not gc after meta server restart #2066

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions src/meta/duplication/meta_duplication_service.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@
#include "utils/ports.h"
#include "utils/string_conv.h"
#include "utils/zlocks.h"
#include "utils/defer.h"

DSN_DECLARE_bool(dup_ignore_other_cluster_ids);

Expand Down Expand Up @@ -701,6 +702,10 @@ void meta_duplication_service::recover_from_meta_state()
_meta_svc->get_meta_storage()->get_children(
get_duplication_path(*app),
[this, app](bool node_exists, const std::vector<std::string> &dup_id_list) {
dsn::defer([this, app]() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory state will be updated [1] after get data from ZK successfully in line 721, why update it again?

  1. https://github.com/apache/incubator-pegasus/blob/master/src/meta/duplication/meta_duplication_service.cpp#L787

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory state will be updated [1] after get data from ZK successfully in line 721, why update it again?

  1. https://github.com/apache/incubator-pegasus/blob/master/src/meta/duplication/meta_duplication_service.cpp#L787

Because in function recover_from_meta_state, the vector dup_id_list may be empty.
Here is the detail of our online condition:

  1. Delete the existing hot standby first
    A. Delete the corresponding dup from zk
    B. Set duplicating in meta_server memory to 0 (not saved in zk), duplicating in zk is still 1
  2. meta_server restarts
    Read dup and app_info data from zk to memory. At this time, dup is empty, but duplicating in zk is still 1 (because dup is empty, it is impossible to delete the hot standby to make duplicating 0).
  3. Meta passes the hot standby status to the replica through on_config_sync
  4. Replica GC checks that the dup map is empty (min_confirmed_decree = invalid_decree), but _is_duplication_master = true, so GC is not performed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information. So this patch is aim to complete step 2? Why not complete the 1.B step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information. So this patch is aim to complete step 2? Why not complete the 1.B step?

Both two ways can fix this problem. In my opinion, the ways like step 2 is easier to implement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness and consistent is important. Update the duplicating status to ZK is not very difficult if all duplications removed.
Or, if the duplicating memory status can be recovered from <app>/duplication ZK path, just competely remove the duplicating status from ZK.

zauto_write_lock l(app_lock());
refresh_duplicating_no_lock(app);
});
if (!node_exists) {
// if there's no duplication
return;
Expand Down
Loading