-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hosa fails to collect metrics from Jolokia endpoint on OCP #147
Comments
I would have thought just putting in the tls: skip flag would do the trick, without all the other stuff. Did you try just that? Avoid doing all that other stuff - just put in that tls: section in your pod's configmap which should tell the agent to skip cert validation for that jolokia endpoint. That should have done it, unless there is something else going on that I don't understand. BTW: just to be clear, that tls: section goes in the pod's configmap - not the agent's own config map. |
Yes, I've already tried that. |
Ohhhh, its a client certificate thing. Yeah, we really need to figure this out. @mwringe you know all about the client cert stuff in OpenShift. I thought we could support that with some configuration. |
@jmazzitelli we are still waiting for OpenShift to allow client certificates to be managed by their system. This wouldn't exactly help in this situation either since the EAP containers are configured to only trust the API Proxy connection. |
That's exactly why I've tried to set the master proxy certs in the HOSA configmap, but I got that strange panic error. |
I am fairly certain I tried to use the API proxy client certificates with HOSA at some point early on and it worked, but that was before a lot of changes happened. But this is obviously not something we would advocate or recommend to users. |
Ok so how do you manage to collect metrics from a jolokia endpoint exposed by a jbosseap container ? |
we have an example with wildfly that configures the jolokia agent using usernames and passwords: https://github.com/hawkular/hawkular-openshift-agent/tree/master/examples/jolokia-wildfly-example Out of the box metric gathering for the EAP middleware containers is something we are still working on. Its currently configured to only accept Jolokia connections from the API Proxy. While it would be possible for HOSA to read from the API proxy directly, this presents a bottleneck in the API proxy and is not something we would want to use just because it cannot scale. |
@ljuaneda I think the way to fix this at this point in time is what @mwringe mentioned - take a look at the HOSA wildfly example and do what it does (i.e. configure EAP middleware containers as that example WF is). HOSA is able to collect metrics over https/jolokia using that configuration. I will agree with @mwringe here - going through the master API proxy is very bad because there is no way that can scale to any large-sized cluster (it would be a bottleneck to pass every metric collection request for all our EAPs through the master API proxy). Thus, we aren't concentrating on supporting that - unless someone tells us otherwise that it is OK to use the master API proxy in this manner, that its performance is acceptable when used in such a way, and that we can ignore failures to reach acceptable performance numbers at scale if it comes to that (which we predict it will). |
I totally agree that collecting metrics through the API proxy is a bad idea, surely leading to bottlenecks and should be avoided. Anyway, how can you reconfigure the jolokia endpoint with the eap image ? (sorry I'm not quite confortable with Dockerfile stuffs) |
By the way, earlier this year, I did make it worked without extra settings with an amq image on ocp 3.3 (no insecure tls for example). I guess the jolokia endpoint is not configured the same way in eap. Maybe something changed with OCP 3.4, or with the lasts releases of hosa. |
UPDATE: ignore this comment in its entirety - I see what caused the panic and it has nothing to do with the SSL setup.
Hmm.. that is interesting. What does your curl command look like? Are you telling curl some specific things (like -k and other things?) Now, the fact that go panics tells me this might be a bug somewhere in Go or the go libraries we are using. What version of Go are you using, just so we know? |
Also, @ljuaneda what version of HOSA are you using? |
Whoa... I think I see what caused the panic. @ljuaneda your SSL setup is correct - you are getting data back, but this is crashing:
We are getting data back from Jolokia REST client, but the value is nil. I need to do some nil checking here. Odd that we aren't getting data back. OK, so @ljuaneda let me write up a PR and fix this and I'll release. I will try to get this done tonight. It shouldn't be hard. |
@ljuaneda just so we can see what Jolokia itself is returning, can you make a REST request using curl to get this data? Post what the JSON looks like. It seems odd we are getting a nil here. If you put HOSA in trace mode, this will log and hopefully show you the REST requests it is making:
To put HOSA in trace mode, go in HOSA's pod definition yaml and where you see HOSA command line to start it up, you'll see "-v 4" is passed in - change that 4 to a "5" to go into trace mode. I'm just curious why the Jolokia client is giving a nil here. |
Ok, I'll do that tomorrow morning. It's close to midnight here in France. It'll have to wait till I'm to work but I'm sure we'll make it work quite shortly ;-) |
OK, I pushed a fix. I have a feeling this needs to be fixed further but for now this at least avoids a panic. This will be pushed to docker hub shortly under the "latest" tag - so @ljuaneda when you get a chance in the morning, try that version. Why I say this needs to be fixed further is that I am wondering if your JMX MBean's attribute that is being collected isn't a true float value. Perhaps it is a float in string form? Or perhaps it is an int? (maybe it truly is nil?). But if it is a number (in some type like int or String) we won't be collecting that either. I will probably make this code more fault tolerant - e.g. if the value is a int, I will convert it to float. If it is a String, see if I can parse it as a float and if so use it as such. That kind of a thing. I would have thought Jolokia did this kind of conversion (why else would they provide this "GetValueAsFloat" method??) but maybe it doesn't? But again this fix at least avoids the panic. It won't store that data point it is getting because the value is nil - we just have to find out why this particular value came back nil from Jolokia. |
I'm running Hawkular OpenShift Agent: Version: 1.2.4.Final-SNAPSHOT, Commit: 355a44f
Only activating TLS_skip_certificate_validation, I see in the logs :
@jmazzitelli I think you're right and it is related to the beans.
The panic error occurs when I try to collect java.lang:name=PS MarkSweep,type=GarbageCollector#LastGcInfo#duration
By the way, adding the certs to HOSA has an other bad side effect, it seems it fails to collect its own metrics on the prometheus endpoint :
... leading all the HOSA pods to fail :
|
@jmazzitelli much better with f5857cc, the panic error is gone 🥇 DEBUG: Received non-float value [] for metric [java.lang:name=PS MarkSweep,type=GarbageCollector#LastGcInfo#duration]
Now stays the side effect to deal with :
|
OK, so that MBean had no data returned by Jolokia:
That explains the nil. Compare that with an MBean that does have a value:
So at minimum this fixes the issue where an MBean has no metric data yet. |
One note: it looks like Jolokia will convert int values to floats so there is nothing I would need to do further, at least it seems numeric values are getting converted to floats automatically:
There the value is an int - 162. And notice we are able to get it from Jolokia using our code that expects float:
So that's just confirmation that we can handle numeric values of different types and Jolokia will convert it for us. |
The "connection refused" tells me the metrics endpoint isn't even running? @ljuaneda What does your agent config map look like? Specifically, the "emitter" section. Here's the Go code to show you what's allowed. The metrics emitter should be on by default, unless "metrics_enabled" is explicitly false. What other settings do you have in your agent config? Do you see any messages at startup that indicate something is wrong with the metrics emitter? If it was a problem with the SSL setup, I would have expected some other error - not "getsockopt: connection refused" - that just seems like |
This is the content of the HOSA configmap :
|
When you remove the identity section, HOSA can scrape its own metrics, otherwise it fails. |
The "identity" is supposed to identify the agent - a single certificate/private key identifies something and the agent "identity" identifies this agent. So it only has one identity (or it is supposed to). The idea is you assign the agent a certificate/private key and that is the agent's own certificate identifying itself so every HTTPS request to every endpoint the agent makes it will identify itself with that certificate. This problem is all about this client cert issue I believe - if OpenShift fixes that, the problem can be solved. This has been discussed several times with some OpenShift folks. Hopefully this kind of problem gets solved soon. That said, because this issue comes up so often, we have this issue (issue #75) which we may have to implement because this is really getting to be a problem for people. |
This issue with the panic on nil jolokia values has been fixed and released (1.3.1.Final). The other issue recently mentioned is tracked in issue #75 already. So we can close this issue. |
@jmazzitelli, even when I explicitly desactivate all emiter metrics, it ends failing the HOSA container.
unfortunatly this time, no track of an error... |
disable the hawkualr agent endpoint collection so it doesn't even try to collect metrics from it.
|
The problem seems elsewhere because HOSA is sill falling in CrashLoopBackOff because the livenessprobes can't http://:8080/health :
|
Based on your previous log output:
You disabled the health probe endpoint - so yes the agent pod will look like it is not ready. If you disable the health probe endpoint, you must remove the health probe definition in your agent's pod yaml definition. |
Not that simple I'm affraid...
This only occurs when I add the identity section in the HOSA configmap
wether I enable the health probe or not.
Le jeu. 16 mars 2017 à 20:13, John Mazzitelli <[email protected]> a
écrit :
… Based on your previous log output:
I0316 14:08:15.281112 1 emitter_server.go:72] Agent emitter will NOT
provide a health probe
You disabled the health probe endpoint - so yes the agent pod will look
like it is not ready.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#147 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKPgJQdbHBgzu76N-2VgR0cPVoT-Y4sEks5rmYnLgaJpZM4MeTpU>
.
|
sigh I know why : https://github.com/hawkular/hawkular-openshift-agent/blob/master/emitter/emitter_server.go#L99 What is happening is the agent is given its identity so when it creates its server for accepting requests for health and metrics, it creates that server behind https with its identity as its server cert. But apparently, that isn't good enough (I believe the cert you are using is a client-cert which probably can't be used here). I'm just guessing for now. But rather than continue this discussion in this git issue (which is fixed - the panic when collecting the jolokia metric no longer happens) create another git issue about this or start a thread on hawkular-dev (which might be better because then we'll get everyone to hear what is going on about this). |
Hello,
When trying to gather metrics from a jolokia endpoint from a jbosseap container running on openshift, I've got the following error :
It's obviously complaining about the certificates...
I also got the same error trying with curl from a master :
Looking at #84 (comment)
I've created the master-proxy secret in the default namespace from master:
I've added the certificates in the HOSA config file :
and then the certificates to the daemonset :
then, I modified the metrics configmap to activate the tls insecure mode:
But it's not working :
The text was updated successfully, but these errors were encountered: