Agent hangs if it doesn't start with the right permissions #111

mwringe · 2017-01-25T23:12:03Z

When starting the agent with the proper permissions, it will throw the following error in the logs and hang:

1 node_event_consumer.go:72] Error obtaining information about the agent pod [openshift-infra/hawkular-openshift-agent-qzg21]. err=User "system:serviceaccount:openshift-infra:hawkular-openshift-agent" cannot get pods in project "openshift-infra"

If the SA is given the proper permissions, the pod will still hang. If the pod is restarted it will startup properly.

By hanging like this, its left in a position where its indicating that its ready and running properly (status 1/1). At the very least, if it cannot properly continue, it should exit so that a new pod can be started in its place.

In this case, I believe the agent should wait and attempt to connect a few more times after some delay. We could even use a 'readiness probe' here to determine when the agent reaches a ready state.

The text was updated successfully, but these errors were encountered:

mwringe · 2017-01-25T23:12:26Z

We should also consider the situation where the permission is revoked while the agent is running and how that should be handled as well

jmazzitelli · 2017-01-26T00:13:40Z

We have a health probe endpoint - when errors like this occur, we can flip these values and the health probe will see a problem and act appropriately (shutdown / restart the agent?): https://github.com/hawkular/hawkular-openshift-agent/blob/master/emitter/health/health_emitter.go

mwringe · 2017-01-26T15:21:53Z

I think a more appropriate action would be to have the pod stay in the 'not ready' phase until the permission is granted. We can use a readiness probe for that. The logs should clearly indicate what the problem is and how to fix it. And once the permission has been granted the pod can continue and enter the ready state.

I don't think we want to restart the pod in this case. That will cause a crashloopback problem which looks like our pods are really unstable and crashing. Is this also more portrayed as an error condition back to the user.

I think the same thing should be done if the permission is revoked. We shouldn't restart the pod in the case, but perhaps log the error and continuously check if the permission has been re-granted. And in this case to also exit the ready state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent hangs if it doesn't start with the right permissions #111

Agent hangs if it doesn't start with the right permissions #111

mwringe commented Jan 25, 2017

mwringe commented Jan 25, 2017

jmazzitelli commented Jan 26, 2017 •

edited

Loading

mwringe commented Jan 26, 2017

Agent hangs if it doesn't start with the right permissions #111

Agent hangs if it doesn't start with the right permissions #111

Comments

mwringe commented Jan 25, 2017

mwringe commented Jan 25, 2017

jmazzitelli commented Jan 26, 2017 • edited Loading

mwringe commented Jan 26, 2017

jmazzitelli commented Jan 26, 2017 •

edited

Loading