-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Framework stuck at broker reconciling state #306
Comments
Is it actually stuck? The default reconciliation timeout is pretty high, 30 minutes I think. Can you attach broker logs from before/after the framework restart? |
All the brokers are no longer running, but the framework zk state remembered a broker in |
@steveniemitz Any thought? You can easily recreate the problem by having a framework with 1 running broker, stop the framework first, then kill the broker, manually change the broker state from I would suggest removing the |
I'll have some time to look at this soon, the reconciliation logic is fairly complicated because there are a bunch of edge cases, it's not as simple as just removing that if block. I'd really like to see the logs from before and after the framework restarts, I'm still confused how you're getting in this state. What version of the framework and mesos are you running? Also, when it gets in this state, manually stopping it from the CLI should be enough to get it out of reconciling, have you tried that? |
It's very likely to happen when there is a rolling restart of the mesos slaves, say you have 2 slaves: slave01 and slave02, broker is running on slave01 and framework running on slave02.
Here is an example state from /api/broker/list
Stopping the broker does not work ( The only relevant log I got after framework restart is
The log before the framework restart won't matter, as long as you time it to kill the framework when a broker is We are using mesos 0.28.2, from my understanding reconciliation is framework driven, so as long as the framework skip it in the startup |
This should fix issue mesos#306 which prevents the reconciling task from starting if a previous one doesn't fully complete.
We had several occasions where the mesos slave with running broker got restarted, the framework tries to reconcile the broker task, then the framework got restarted while the broker state is "reconciling" (saved in zk), after that the framework will be stuck due to the reconciling broker state which wasn't reconciling at all. It will not start any brokers even though the broker is no longer running, the only way I found to fix this is to manually go into zk
/kafka-mesos
node and delete all the "task" broker attributes that contains "reconciling" state from the brokers json.The code causing this problem:
https://github.com/mesos/kafka/blob/master/src/scala/main/ly/stealth/mesos/kafka/scheduler/mesos/TaskReconciler.scala#L124
Maybe it should resume the reconciliation or remove the check all together, rather than do nothing if the state is reconciling.
The text was updated successfully, but these errors were encountered: