Skip to content

[DPE-3684] Implement DA139 #663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 46 commits into from

Conversation

dragomirp
Copy link
Contributor

@dragomirp dragomirp commented Oct 27, 2024

Implement DA139:

  • Change the promote-to-primary to promote units and reinitialise RAFT
  • Add status messages for stuck RAFT

Copy link

codecov bot commented Oct 27, 2024

Codecov Report

Attention: Patch coverage is 58.49057% with 22 lines in your changes missing coverage. Please review.

Project coverage is 71.85%. Comparing base (16e36d0) to head (72ddcac).
Report is 8 commits behind head on dpe-3684-reinitialise-raft.

Files with missing lines Patch % Lines
src/cluster.py 38.09% 13 Missing ⚠️
src/charm.py 70.96% 4 Missing and 5 partials ⚠️
Additional details and impacted files
@@                      Coverage Diff                       @@
##           dpe-3684-reinitialise-raft     #663      +/-   ##
==============================================================
- Coverage                       72.18%   71.85%   -0.33%     
==============================================================
  Files                              15       15              
  Lines                            3426     3464      +38     
  Branches                          528      535       +7     
==============================================================
+ Hits                             2473     2489      +16     
- Misses                            827      844      +17     
- Partials                          126      131       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dragomirp dragomirp changed the title [DPE-3684] Three units scenarios [DPE-3684] Implement DA139 Dec 24, 2024
Comment on lines -109 to -111
self.framework.observe(
self.charm.on.promote_to_primary_action, self._on_promote_to_primary
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the main charm code, since it's no longer used only for async promotion.

Comment on lines -865 to -874
try:
health_status = self.get_patroni_health()
except Exception:
logger.warning("Remove raft member: Unable to get health status")
health_status = {}
if health_status.get("role") in ("leader", "master") or health_status.get(
"sync_standby"
):
logger.info(f"{self.charm.unit.name} is raft candidate")
data_flags["raft_candidate"] = "True"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for the action to start reinit

@@ -746,15 +747,18 @@ def stop_patroni(self) -> bool:
logger.exception(error_message, exc_info=e)
return False

def switchover(self) -> None:
def switchover(self, candidate: str | None = None) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass a candidate when promoting a specific unit.

for unit in units:
logger.info(f"Stopping unit {unit}")
await stop_machine(ops_test, await get_machine_from_unit(ops_test, unit))
await sleep(15)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleep for the Juju leadership to drift.

Comment on lines +109 to +114
# Check if Patroni self healed
assert (
left_unit.workload_status == "active"
and left_unit.workload_status_message == "Primary"
)
logger.warning(f"Patroni self-healed without raft reinitialisation for roles {roles}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes when removing the primary and async replica, Patroni manages to survive, so adding an exception for this case. Should I nail it down further?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need for that.

@dragomirp dragomirp marked this pull request as ready for review January 24, 2025 01:43
@dragomirp dragomirp requested review from a team, taurus-forever, marceloneppel and lucasgameiroborges and removed request for a team January 24, 2025 01:43
Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, Dragomir!

Comment on lines +109 to +114
# Check if Patroni self healed
assert (
left_unit.workload_status == "active"
and left_unit.workload_status_message == "Primary"
)
logger.warning(f"Patroni self-healed without raft reinitialisation for roles {roles}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need for that.

@dragomirp dragomirp force-pushed the dpe-3684-three-units branch from 9b0b37a to adb8ba9 Compare January 29, 2025 13:33
@dragomirp
Copy link
Contributor Author

Mereged into #611 manually.

@dragomirp dragomirp closed this Feb 4, 2025
@dragomirp dragomirp deleted the dpe-3684-three-units branch March 21, 2025 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants