-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing broker task attribute 'endpoint' on broker startup / reconciliation #310
Comments
Here is an example log sequence when this issue happens:
https://github.com/mesos/kafka/blob/master/src/scala/main/ly/stealth/mesos/kafka/scheduler/BrokerLifecycleManager.scala#L79 A potential fix could be adding the
|
This is a good catch, and thanks for the investigation. If I recall, the reason the endpoint isn't set on task creation is to avoid advertising a broker that isn't ready to accept connections yet. If you wanted you could try adding |
My concern is that if for whatever reason the status update from the executor failed to be processed by the framework, the
Currently the only way to populate the endpoint is to get it from the above message which is only sent once from the executor when a broker become ready. This is very fragile and when it fails the endpoint is lost forever until the broker restart, reconciliation or restarting the framework won't recover the endpoint info. I think adding the extra match case should work, in addition I'd still like to generate the endpoint on task creation, since clients could look at the |
We have hit this issues multiple times where the framework is restarted when brokers are starting up, during reconciliation the framework fails to register an endpoint on that broker as it does not recognise TASK_STARTING as a valid state.
This eventually leaves the broker state as follows when running:
As the framework knows the hostname and port (from the mesos offer it receives) that the broker will start up on, wouldn't it make sense to add the endpoint initially when launching the task rather than appending to it after the task has started.
The text was updated successfully, but these errors were encountered: