Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hosa fails to collect metrics from Jolokia endpoint on OCP #147

Closed
ljuaneda opened this issue Mar 15, 2017 · 32 comments
Closed

hosa fails to collect metrics from Jolokia endpoint on OCP #147

ljuaneda opened this issue Mar 15, 2017 · 32 comments

Comments

@ljuaneda
Copy link

Hello,

When trying to gather metrics from a jolokia endpoint from a jbosseap container running on openshift, I've got the following error :

I0315 17:30:15.685203       1 discovery.go:362] DEBUG: Detected a new configmap [rnx-intranetsf-slot1] for namespace [asse-qt9stome1] but has no entry named [hawkular-openshift-agent]. Ignoring.
W0315 17:30:15.685211       1 node_event_consumer.go:168] Will not start collecting for endpoint in pod [siy05ezb/asse-qt9stome1/rnx-intranetsf-slot1-23-3zhrk/0d4a01e5-09a5-11e7-9706-0090fae9dd7a] - cannot build URL. err=Endpoint not assigned an IP address yet
I0315 17:30:15.701890       1 discovery.go:295] DEBUG: Detected a modified pod: siy05ezb/asse-qt9stome1/rnx-intranetsf-slot1-23-3zhrk/0d4a01e5-09a5-11e7-9706-0090fae9dd7a
I0315 17:30:15.701922       1 discovery.go:138] DEBUG: Changed pod [siy05ezb/asse-qt9stome1/rnx-intranetsf-slot1-23-3zhrk/0d4a01e5-09a5-11e7-9706-0090fae9dd7a] with volume [hawkular-openshift-agent=stom-config] has a config map
W0315 17:30:15.701960       1 node_event_consumer.go:168] Will not start collecting for endpoint in pod [siy05ezb/asse-qt9stome1/rnx-intranetsf-slot1-23-3zhrk/0d4a01e5-09a5-11e7-9706-0090fae9dd7a] - cannot build URL. err=Endpoint not assigned an IP address yet
I0315 17:30:19.591596       1 discovery.go:295] DEBUG: Detected a modified pod: siy05ezb/asse-qt9stome1/rnx-intranetsf-slot1-23-3zhrk/0d4a01e5-09a5-11e7-9706-0090fae9dd7a
I0315 17:30:19.591613       1 discovery.go:138] DEBUG: Changed pod [siy05ezb/asse-qt9stome1/rnx-intranetsf-slot1-23-3zhrk/0d4a01e5-09a5-11e7-9706-0090fae9dd7a] with volume [hawkular-openshift-agent=stom-config] has a config map
I0315 17:30:19.591676       1 metrics_collector_manager.go:118] DEBUG: Collection interval for [asse-qt9stome1|rnx-intranetsf-slot1-23-3zhrk|jolokia|https://10.129.4.252:8778/jolokia] is not defined - setting to the default interval [30s]
I0315 17:30:19.591685       1 metrics_collector_manager.go:140] START collecting metrics from [asse-qt9stome1|rnx-intranetsf-slot1-23-3zhrk|jolokia|https://10.129.4.252:8778/jolokia] every [30s]
I0315 17:30:21.205764       1 prometheus_metrics_collector.go:97] DEBUG: Told to collect all Prometheus metrics from [http://10.129.4.251:8080/metrics]
I0315 17:30:21.214265       1 metrics_storage.go:165] DEBUG: Stored datapoints for [34] metrics
I0315 17:30:42.759272       1 discovery.go:295] DEBUG: Detected a modified pod: siy05ezb/asse-qt9stome1/rnx-intranetsf-slot1-23-3zhrk/0d4a01e5-09a5-11e7-9706-0090fae9dd7a
I0315 17:30:42.759289       1 discovery.go:138] DEBUG: Changed pod [siy05ezb/asse-qt9stome1/rnx-intranetsf-slot1-23-3zhrk/0d4a01e5-09a5-11e7-9706-0090fae9dd7a] with volume [hawkular-openshift-agent=stom-config] has a config map
I0315 17:30:42.759316       1 metrics_collector_manager.go:381] STOP collecting metrics from [asse-qt9stome1|rnx-intranetsf-slot1-23-3zhrk|jolokia|https://10.129.4.252:8778/jolokia]
I0315 17:30:42.759389       1 metrics_collector_manager.go:118] DEBUG: Collection interval for [asse-qt9stome1|rnx-intranetsf-slot1-23-3zhrk|jolokia|https://10.129.4.252:8778/jolokia] is not defined - setting to the default interval [30s]
I0315 17:30:42.759399       1 metrics_collector_manager.go:140] START collecting metrics from [asse-qt9stome1|rnx-intranetsf-slot1-23-3zhrk|jolokia|https://10.129.4.252:8778/jolokia] every [30s]
I0315 17:30:51.205802       1 prometheus_metrics_collector.go:97] DEBUG: Told to collect all Prometheus metrics from [http://10.129.4.251:8080/metrics]
I0315 17:30:51.215282       1 metrics_storage.go:165] DEBUG: Stored datapoints for [34] metrics
I0315 17:31:12.759678       1 jolokia_metrics_collector.go:82] DEBUG: Told to collect [4] Jolokia metrics from [https://10.129.4.252:8778/jolokia]
W0315 17:31:12.905554       1 metrics_collector_manager.go:181] Failed to collect metrics from [asse-qt9stome1|rnx-intranetsf-slot1-23-3zhrk|jolokia|https://10.129.4.252:8778/jolokia] at [Wed, 15 Mar 2017 17:31:12 +0000]. err=Failed to collect metrics from Jolokia endpoint [https://10.129.4.252:8778/jolokia]. err=Post https://10.129.4.252:8778/jolokia: x509: cannot validate certificate for 10.129.4.252 because it doesn't contain any IP SANs

It's obviously complaining about the certificates...
I also got the same error trying with curl from a master :

[root@siy05ez1:~]# curl -vk https://10.129.4.252:8778/jolokia
* About to connect() to 10.129.4.252 port 8778 (#0)
*   Trying 10.129.4.252...
* Connected to 10.129.4.252 (10.129.4.252) port 8778 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* NSS error -5938 (PR_END_OF_FILE_ERROR)
* Encountered end of file
* Closing connection 0
curl: (35) NSS: client certificate not found (nickname not specified)

Looking at #84 (comment)

I've created the master-proxy secret in the default namespace from master:

oc secrets new master-proxy /etc/origin/master/master.proxy-client.crt /etc/origin/master/master.proxy-client.key -n default

I've added the certificates in the HOSA config file :

  config.yaml: |
    kubernetes:
      tenant: ${POD:namespace_name}
    hawkular_server:
      url: https://metrics.osev3-qt9-infra.hpmetier.sf.intra.laposte.fr
      credentials:
        username: secret:openshift-infra/hawkular-metrics-account/hawkular-metrics.username
        password: secret:openshift-infra/hawkular-metrics-account/hawkular-metrics.password
      ca_cert_file: secret:openshift-infra/hawkular-metrics-certificate/hawkular-metrics-ca.certificate
    emitter:
      status_enabled: true
    identity:
      cert_file: /master-proxy/master.proxy-client.crt
      private_key_file: /master-proxy/master.proxy-client.key
    collector:
      minimum_collection_interval: 10s
      default_collection_interval: 30s
      metric_id_prefix: pod/${POD:uid}/custom/
      tags:
        metric_name: ${METRIC:name}
        description: ${METRIC:description}
        units: ${METRIC:units}
        namespace_id: ${POD:namespace_uid}
        namespace_name: ${POD:namespace_name}
        node_name: ${POD:node_name}
        pod_id: ${POD:uid}
        pod_ip: ${POD:ip}
        pod_name: ${POD:name}
        pod_namespace: ${POD:namespace_name}
        hostname: ${POD:hostname}
        host_ip: ${POD:host_ip}
        labels: ${POD:labels}
        type: pod
        collector: hawkular_openshift_agent
        custom_metric: true

and then the certificates to the daemonset :

       volumeMounts:
        - mountPath: /master-proxy
          name: master-proxy
...
      volumes:
      - name: master-proxy
        secret:
          defaultMode: 420
          secretName: master-proxy

then, I modified the metrics configmap to activate the tls insecure mode:

endpoints:
- type: jolokia
  protocol: https
  port: 8778
  path: /jolokia/
  tls:
    skip_certificate_validation: true

But it's not working :

I0315 17:49:33.227925       1 discovery.go:362] DEBUG: Detected a new configmap [rnx-intranetsf-slot1] for namespace [asse-qt9stome1] but has no entry named [hawkular-openshift-agent]. Ignoring.
I0315 17:49:33.835167       1 discovery.go:295] DEBUG: Detected a modified pod: siy05ezd/default/hawkular-openshift-agent-pttf3/2e57c14d-09a7-11e7-9706-0090fae9dd7a
I0315 17:49:33.835188       1 discovery.go:138] DEBUG: Changed pod [siy05ezd/default/hawkular-openshift-agent-pttf3/2e57c14d-09a7-11e7-9706-0090fae9dd7a] with volume [hawkular-openshift-agent=hawkular-openshift-agent-configuration] has a config map
I0315 17:49:33.835228       1 metrics_collector_manager.go:381] STOP collecting metrics from [default|hawkular-openshift-agent-pttf3|prometheus|http://10.130.5.2:8080/metrics]
I0315 17:49:33.835277       1 metrics_collector_manager.go:140] START collecting metrics from [default|hawkular-openshift-agent-pttf3|prometheus|http://10.130.5.2:8080/metrics] every [30s]
I0315 17:50:03.228071       1 jolokia_metrics_collector.go:82] DEBUG: Told to collect [4] Jolokia metrics from [https://10.130.5.3:8778/jolokia/]
panic: interface conversion: interface is nil, not float64

goroutine 75 [running]:
panic(0x1233160, 0xc42009c780)
        /home/travis/.gimme/versions/go1.7.5.linux.amd64/src/runtime/panic.go:500 +0x1a1
github.com/hawkular/hawkular-openshift-agent/collector/impl.(*JolokiaMetricsCollector).CollectMetrics(0xc42009c340, 0xc40d970beb, 0x1b5b700, 0x0, 0xc4202fab90, 0x0)
        /home/travis/gopath/src/github.com/hawkular/hawkular-openshift-agent/collector/impl/jolokia_metrics_collector.go:122 +0x99d
github.com/hawkular/hawkular-openshift-agent/collector/manager.(*MetricsCollectorManager).StartCollecting.func1(0x1b2f300, 0xc42009c340, 0xc42009c400, 0xc420066300, 0x5a, 0xc4200662a0, 0x55, 0xc42058f3b0)
        /home/travis/gopath/src/github.com/hawkular/hawkular-openshift-agent/collector/manager/metrics_collector_manager.go:176 +0x4fb
created by github.com/hawkular/hawkular-openshift-agent/collector/manager.(*MetricsCollectorManager).StartCollecting
        /home/travis/gopath/src/github.com/hawkular/hawkular-openshift-agent/collector/manager/metrics_collector_manager.go:370 +0x83a
@jmazzitelli
Copy link
Contributor

I would have thought just putting in the tls: skip flag would do the trick, without all the other stuff. Did you try just that? Avoid doing all that other stuff - just put in that tls: section in your pod's configmap which should tell the agent to skip cert validation for that jolokia endpoint. That should have done it, unless there is something else going on that I don't understand.

BTW: just to be clear, that tls: section goes in the pod's configmap - not the agent's own config map.

@ljuaneda
Copy link
Author

Yes, I've already tried that.
I'll provide the logs if it helps but as you see above with curl, jolokia is requesting a client certificate.

@jmazzitelli
Copy link
Contributor

Ohhhh, its a client certificate thing. Yeah, we really need to figure this out. @mwringe you know all about the client cert stuff in OpenShift. I thought we could support that with some configuration.

@mwringe
Copy link
Contributor

mwringe commented Mar 15, 2017

@jmazzitelli we are still waiting for OpenShift to allow client certificates to be managed by their system. This wouldn't exactly help in this situation either since the EAP containers are configured to only trust the API Proxy connection.

@ljuaneda
Copy link
Author

That's exactly why I've tried to set the master proxy certs in the HOSA configmap, but I got that strange panic error.

@mwringe
Copy link
Contributor

mwringe commented Mar 15, 2017

I am fairly certain I tried to use the API proxy client certificates with HOSA at some point early on and it worked, but that was before a lot of changes happened. But this is obviously not something we would advocate or recommend to users.

@ljuaneda
Copy link
Author

Ok so how do you manage to collect metrics from a jolokia endpoint exposed by a jbosseap container ?
We absolutly need this feature, otherwise we will be forced to drop jolokia and hosa solution...
Is there anyway to fix this ?

@mwringe
Copy link
Contributor

mwringe commented Mar 15, 2017

we have an example with wildfly that configures the jolokia agent using usernames and passwords: https://github.com/hawkular/hawkular-openshift-agent/tree/master/examples/jolokia-wildfly-example

Out of the box metric gathering for the EAP middleware containers is something we are still working on. Its currently configured to only accept Jolokia connections from the API Proxy.

While it would be possible for HOSA to read from the API proxy directly, this presents a bottleneck in the API proxy and is not something we would want to use just because it cannot scale.

@jmazzitelli
Copy link
Contributor

Is there anyway to fix this ?

@ljuaneda I think the way to fix this at this point in time is what @mwringe mentioned - take a look at the HOSA wildfly example and do what it does (i.e. configure EAP middleware containers as that example WF is). HOSA is able to collect metrics over https/jolokia using that configuration.

I will agree with @mwringe here - going through the master API proxy is very bad because there is no way that can scale to any large-sized cluster (it would be a bottleneck to pass every metric collection request for all our EAPs through the master API proxy). Thus, we aren't concentrating on supporting that - unless someone tells us otherwise that it is OK to use the master API proxy in this manner, that its performance is acceptable when used in such a way, and that we can ignore failures to reach acceptable performance numbers at scale if it comes to that (which we predict it will).

@ljuaneda
Copy link
Author

I totally agree that collecting metrics through the API proxy is a bad idea, surely leading to bottlenecks and should be avoided.
But, here I was trying to collect directly from the jolokia container's endpoint with the master-proxy certs it is resquesting. It actually works fine through curl, so why it leads to an error with HOSA ?

Anyway, how can you reconfigure the jolokia endpoint with the eap image ? (sorry I'm not quite confortable with Dockerfile stuffs)

@ljuaneda
Copy link
Author

By the way, earlier this year, I did make it worked without extra settings with an amq image on ocp 3.3 (no insecure tls for example). I guess the jolokia endpoint is not configured the same way in eap. Maybe something changed with OCP 3.4, or with the lasts releases of hosa.

@jmazzitelli
Copy link
Contributor

jmazzitelli commented Mar 15, 2017

UPDATE: ignore this comment in its entirety - I see what caused the panic and it has nothing to do with the SSL setup.

It actually works fine through curl, so why it leads to an error with HOSA

Hmm.. that is interesting. What does your curl command look like? Are you telling curl some specific things (like -k and other things?)

Now, the fact that go panics tells me this might be a bug somewhere in Go or the go libraries we are using. panic: interface conversion: interface is nil, not float64 also could mean there is a bug in HOSA code itself when doing some pointer reference.

What version of Go are you using, just so we know?

@jmazzitelli
Copy link
Contributor

Also, @ljuaneda what version of HOSA are you using?

@jmazzitelli
Copy link
Contributor

Whoa... I think I see what caused the panic. @ljuaneda your SSL setup is correct - you are getting data back, but this is crashing:

https://github.com/hawkular/hawkular-openshift-agent/blob/master/collector/impl/jolokia_metrics_collector.go#L122

	Value:     resp.GetValueAsFloat(),

We are getting data back from Jolokia REST client, but the value is nil. I need to do some nil checking here. Odd that we aren't getting data back.

OK, so @ljuaneda let me write up a PR and fix this and I'll release. I will try to get this done tonight. It shouldn't be hard.

@jmazzitelli
Copy link
Contributor

@ljuaneda just so we can see what Jolokia itself is returning, can you make a REST request using curl to get this data? Post what the JSON looks like. It seems odd we are getting a nil here.

If you put HOSA in trace mode, this will log and hopefully show you the REST requests it is making:

log.Tracef("Making bulk Jolokia request from [%v]:\n%v", url, requests)

To put HOSA in trace mode, go in HOSA's pod definition yaml and where you see HOSA command line to start it up, you'll see "-v 4" is passed in - change that 4 to a "5" to go into trace mode.

I'm just curious why the Jolokia client is giving a nil here.

@ljuaneda
Copy link
Author

Ok, I'll do that tomorrow morning. It's close to midnight here in France. It'll have to wait till I'm to work but I'm sure we'll make it work quite shortly ;-)

jmazzitelli added a commit to jmazzitelli/hawkular-openshift-agent that referenced this issue Mar 15, 2017
@jmazzitelli
Copy link
Contributor

OK, I pushed a fix. I have a feeling this needs to be fixed further but for now this at least avoids a panic. This will be pushed to docker hub shortly under the "latest" tag - so @ljuaneda when you get a chance in the morning, try that version.

Why I say this needs to be fixed further is that I am wondering if your JMX MBean's attribute that is being collected isn't a true float value. Perhaps it is a float in string form? Or perhaps it is an int? (maybe it truly is nil?). But if it is a number (in some type like int or String) we won't be collecting that either. I will probably make this code more fault tolerant - e.g. if the value is a int, I will convert it to float. If it is a String, see if I can parse it as a float and if so use it as such. That kind of a thing. I would have thought Jolokia did this kind of conversion (why else would they provide this "GetValueAsFloat" method??) but maybe it doesn't?

But again this fix at least avoids the panic. It won't store that data point it is getting because the value is nil - we just have to find out why this particular value came back nil from Jolokia.

@ljuaneda
Copy link
Author

I'm running Hawkular OpenShift Agent: Version: 1.2.4.Final-SNAPSHOT, Commit: 355a44f

[root@siy05ez1:~]# oc get pod rnx-intranetsf-slot1-23-d9mxk -o json | jq -r '.status.podIP'
10.130.5.3
[root@siy05ez1:~]# curl -ks -H "Authorization: Bearer $(oc sa get-token management-admin -n management-infra)" -H "Content-Type: application/json" https://10.130.5.3:8778/jolokia/  --cert /etc/origin/master/master.proxy-client.crt --key /etc/origin/master/master.proxy-client.key | jq
{
  "timestamp": 1489652765,
  "status": 200,
  "request": {
    "type": "version"
  },
  "value": {
    "protocol": "7.2",
    "config": {
      "maxDepth": "15",
      "maxCollectionSize": "0",
      "maxObjects": "0",
      "discoveryEnabled": "false",
      "agentContext": "/jolokia",
      "historyMaxEntries": "10",
      "agentId": "10.130.5.3-296-31f4a427-jvm",
      "agentType": "jvm",
      "debug": "false",
      "debugMaxEntries": "100"
    },
    "agent": "1.3.2",
    "info": {
      "product": "jboss",
      "vendor": "RedHat",
      "version": "7.5.12.Final-redhat-1"
    }
  }
}

Only activating TLS_skip_certificate_validation, I see in the logs :

I0316 09:01:49.955960       1 discovery.go:403] DEBUG: Detected a modified configmap [stom-config] for namespace [asse-qt9stome1]
I0316 09:01:49.956331       1 discovery.go:414] TRACE: Modified configmap [stom-config] for namespace [asse-qt9stome1]=ConfigMap[asse-qt9stome1:stom-config]: endpoints:
- type: jolokia
  enabled: "true"
  protocol: https
  port: 8778
  path: /jolokia/
  tls:
    skip_certificate_validation: true
  credentials:
    token: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJtYW5hZ2VtZW50LWluZnJhIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6Im1hbmFnZW1lbnQtYWRtaW4tdG9rZW4tZXVteHUiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoibWFuYWdlbWVudC1hZG1pbiIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjA4OGQ4N2E5LWIxOGMtMTFlNi05MTZjLTAwNTA1NmI4MGMyNiIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDptYW5hZ2VtZW50LWluZnJhOm1hbmFnZW1lbnQtYWRtaW4ifQ.TVOI5O-pEpffk_fZTpJpo9V38ReBMsUQINeaEJjs7wLWzHnbeaAoncVhIVxjMVAvN2uiws_cVVCjJwZixM_QfPzcwjqhNVfFaNq36ya_6fZjZY3uYk8v_3_FJvbBXN2A63SaZ4SdXRrunrrPrU83S7X8A9uuGYPlVi73K4-9N7K9shu7sRtYSxZ0z3AANHxZJKX9-lVfuG589c8TZJPUI6YNviQHYt745Ua5nM9G9fvrxD4OZDB6cJcauoFIE4GwDCa6gx-aq0L0Vvr64ewTki2pZhm2bVt7vizWg7OY4ZfPQHJnbX_nU-KeAcoubuvSSJj51PY3ht3pUOSWUGDtTg
  collection_interval: ""
  tags: {}
  metrics:
  - id: VM Heap Memory Used
    name: java.lang:type=Memory#HeapMemoryUsage#used
    type: gauge
    description: jvm/heap_memory_used
I0316 09:01:49.956349       1 discovery.go:172] DEBUG: Changed configmap [stom-config] for namespace [asse-qt9stome1] affects pod [siy05ezd/asse-qt9stome1/rnx-intranetsf-slot1-23-d9mxk/8a5a25d8-09a7-11e7-9706-0090fae9dd7a] with volume: [hawkular-openshift-agent=stom-config]
I0316 09:01:49.956384       1 metrics_collector_manager.go:381] STOP collecting metrics from [asse-qt9stome1|rnx-intranetsf-slot1-23-d9mxk|jolokia|https://10.130.5.3:8778/jolokia/]
I0316 09:01:49.956503       1 metrics_collector_manager.go:118] DEBUG: Collection interval for [asse-qt9stome1|rnx-intranetsf-slot1-23-d9mxk|jolokia|https://10.130.5.3:8778/jolokia/] is not defined - setting to the default interval [30s]
I0316 09:01:49.956515       1 metrics_collector_manager.go:140] START collecting metrics from [asse-qt9stome1|rnx-intranetsf-slot1-23-d9mxk|jolokia|https://10.130.5.3:8778/jolokia/] every [30s]
I0316 09:02:19.956755       1 jolokia_metrics_collector.go:82] DEBUG: Told to collect [1] Jolokia metrics from [https://10.130.5.3:8778/jolokia/]
I0316 09:02:19.956809       1 jolokia_metrics_collector.go:105] TRACE: Making bulk Jolokia request from [https://10.130.5.3:8778/jolokia/]:
[{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"used"}]
I0316 09:02:19.956830       1 jolokia_request.go:80] TRACE: Sending bulk Jolokia request to [https://10.130.5.3:8778/jolokia/] with [1] individual requests:
[{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"used"}]
W0316 09:02:19.974223       1 metrics_collector_manager.go:181] Failed to collect metrics from [asse-qt9stome1|rnx-intranetsf-slot1-23-d9mxk|jolokia|https://10.130.5.3:8778/jolokia/] at [Thu, 16 Mar 2017 09:02:19 +0000]. err=Failed to collect metrics from Jolokia endpoint [https://10.130.5.3:8778/jolokia/]. err=Post https://10.130.5.3:8778/jolokia/: EOF

@jmazzitelli I think you're right and it is related to the beans.
Reconfiguring HOSA with the master-proxy certs and limiting the scope to jvm/heap_memory_used gauge, I finally got :

I0316 09:13:58.397717       1 jolokia_metrics_collector.go:82] DEBUG: Told to collect [1] Jolokia metrics from [https://10.130.5.3:8778/jolokia/]
I0316 09:13:58.398385       1 jolokia_metrics_collector.go:105] TRACE: Making bulk Jolokia request from [https://10.130.5.3:8778/jolokia/]:
[{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"used"}]
I0316 09:13:58.398417       1 jolokia_request.go:80] TRACE: Sending bulk Jolokia request to [https://10.130.5.3:8778/jolokia/] with [1] individual requests:
[{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"used"}]
I0316 09:13:58.516613       1 jolokia_request.go:120] TRACE: Received Jolokia response:
[{"status":200,"request":{"attribute":"HeapMemoryUsage","mbean":"java.lang:type=Memory","path":"used","type":"read"},"timestamp":1489655638,"value":2.29667544e+08}]
I0316 09:13:58.516698       1 jolokia_metrics_collector.go:149] TRACE: Jolokia metrics collected from endpoint [https://10.130.5.3:8778/jolokia/]:
{asse-qt9stome1 gauge java.lang:type=Memory#HeapMemoryUsage#used [{2017-03-16 09:13:58.39769815 +0000 UTC 2.29667544e+08 map[]}]}
==TOTAL JOLOKIA METRICS COLLECTED=1
I0316 09:13:58.516750       1 metrics_collector_manager.go:354] TRACE: Need to create/update [1] metric definitions for endpoint [asse-qt9stome1|rnx-intranetsf-slot1-23-d9mxk|jolokia|https://10.130.5.3:8778/jolokia/]
I0316 09:13:58.516770       1 metrics_collector_manager.go:451] TRACE: Collecting [1] metric details for endpoint [Endpoint: type=[jolokia], enabled=[true], url=[https://10.130.5.3:8778/jolokia/], coll_int=[], tenant=[asse-qt9stome1], tags=[map[]], metrics=[[Metric: id=[VM Heap Memory Used], name=[java.lang:type=Memory#HeapMemoryUsage#used], type=[gauge], units=[], tags=[map[]]]]]
I0316 09:13:58.516821       1 metrics_collector_manager.go:589] TRACE: [1] metric definitions being declared for endpoint [Endpoint: type=[jolokia], enabled=[true], url=[https://10.130.5.3:8778/jolokia/], coll_int=[], tenant=[asse-qt9stome1], tags=[map[]], metrics=[[Metric: id=[VM Heap Memory Used], name=[java.lang:type=Memory#HeapMemoryUsage#used], type=[gauge], units=[], tags=[map[]]]]]
I0316 09:13:58.516886       1 metrics_storage.go:104] TRACE: Asked to store metric definition: metrics.MetricDefinition{Tenant:"asse-qt9stome1", Type:"gauge", ID:"pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Heap Memory Used", Tags:map[string]string{"pod_ip":"10.130.5.3", "pod_name":"rnx-intranetsf-slot1-23-d9mxk", "collector":"hawkular_openshift_agent", "description":"jvm/heap_memory_used", "pod_id":"8a5a25d8-09a7-11e7-9706-0090fae9dd7a", "labels":"application:rnx-intranetsf-slot1,ccx:rnx,deployment:rnx-intranetsf-slot1-23,deploymentConfig:rnx-intranetsf-slot1,deploymentconfig:rnx-intranetsf-slot1", "hostname":"", "custom_metric":"true", "namespace_name":"asse-qt9stome1", "metric_name":"java.lang:type=Memory#HeapMemoryUsage#used", "node_name":"siy05ezd", "namespace_id":"7f456756-f901-11e6-b8bb-001a64761280", "host_ip":"220.4.104.23", "type":"pod", "pod_namespace":"asse-qt9stome1", "units":""}, RetentionTime:0}
I0316 09:13:58.556251       1 metrics_storage.go:165] DEBUG: Stored datapoints for [1] metrics
I0316 09:13:58.556318       1 metrics_storage.go:168] TRACE: Stored [1] [gauge] datapoints for metric named [pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Heap Memory Used]: [{2017-03-16 09:13:58.39769815 +0000 UTC 2.29667544e+08 map[]}]

The panic error occurs when I try to collect java.lang:name=PS MarkSweep,type=GarbageCollector#LastGcInfo#duration

I0316 09:19:21.420466       1 jolokia_metrics_collector.go:82] DEBUG: Told to collect [4] Jolokia metrics from [https://10.130.5.3:8778/jolokia/]
I0316 09:19:21.421133       1 jolokia_metrics_collector.go:105] TRACE: Making bulk Jolokia request from [https://10.130.5.3:8778/jolokia/]:
[{"type":"read","mbean":"java.lang:type=Threading","attribute":"ThreadCount"},{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"used"},{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"max"},{"type":"read","mbean":"java.lang:name=PS MarkSweep,type=GarbageCollector","attribute":"LastGcInfo","path":"duration"}]
I0316 09:19:21.421155       1 jolokia_request.go:80] TRACE: Sending bulk Jolokia request to [https://10.130.5.3:8778/jolokia/] with [4] individual requests:
[{"type":"read","mbean":"java.lang:type=Threading","attribute":"ThreadCount"},{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"used"},{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"max"},{"type":"read","mbean":"java.lang:name=PS MarkSweep,type=GarbageCollector","attribute":"LastGcInfo","path":"duration"}]
I0316 09:19:21.526626       1 jolokia_request.go:120] TRACE: Received Jolokia response:
[{"status":200,"request":{"attribute":"ThreadCount","mbean":"java.lang:type=Threading","type":"read"},"timestamp":1489655961,"value":162},{"status":200,"request":{"attribute":"HeapMemoryUsage","mbean":"java.lang:type=Memory","path":"used","type":"read"},"timestamp":1489655961,"value":2.81993432e+08},{"status":200,"request":{"attribute":"HeapMemoryUsage","mbean":"java.lang:type=Memory","path":"max","type":"read"},"timestamp":1489655961,"value":5.3215232e+08},{"status":200,"request":{"attribute":"LastGcInfo","mbean":"java.lang:name=PS MarkSweep,type=GarbageCollector","path":"duration","type":"read"},"timestamp":1489655961}]
panic: interface conversion: interface is nil, not float64

goroutine 108 [running]:
panic(0x1233160, 0xc4202834c0)
        /home/travis/.gimme/versions/go1.7.5.linux.amd64/src/runtime/panic.go:500 +0x1a1
github.com/hawkular/hawkular-openshift-agent/collector/impl.(*JolokiaMetricsCollector).CollectMetrics(0xc420619100, 0xc4190f514d, 0x1b5b700, 0x0, 0xc4204c6b90, 0xc42030fc30)
        /home/travis/gopath/src/github.com/hawkular/hawkular-openshift-agent/collector/impl/jolokia_metrics_collector.go:122 +0x99d
github.com/hawkular/hawkular-openshift-agent/collector/manager.(*MetricsCollectorManager).StartCollecting.func1(0x1b2f300, 0xc420619100, 0xc420619140, 0xc4205d5080, 0x5a, 0xc4205d5020, 0x55, 0xc4201ed220)
        /home/travis/gopath/src/github.com/hawkular/hawkular-openshift-agent/collector/manager/metrics_collector_manager.go:176 +0x4fb
created by github.com/hawkular/hawkular-openshift-agent/collector/manager.(*MetricsCollectorManager).StartCollecting
        /home/travis/gopath/src/github.com/hawkular/hawkular-openshift-agent/collector/manager/metrics_collector_manager.go:370 +0x83a

By the way, adding the certs to HOSA has an other bad side effect, it seems it fails to collect its own metrics on the prometheus endpoint :

I0316 09:25:48.562580       1 discovery.go:172] DEBUG: Changed configmap [hawkular-openshift-agent-configuration] for namespace [default] affects pod [siy05ezb/default/hawkular-openshift-agent-11vyg/cd22f65e-0a28-11e7-9706-0090fae9dd7a] with volume: [hawkular-openshift-agent=hawkular-openshift-agent-configuration]
I0316 09:25:48.562592       1 discovery.go:365] DEBUG: Detected a new configmap [haproxy-config] for namespace [default]
I0316 09:25:48.562667       1 metrics_collector_manager.go:140] START collecting metrics from [default|hawkular-openshift-agent-11vyg|prometheus|http://10.129.5.5:8080/metrics] every [30s]
I0316 09:25:48.562674       1 discovery.go:375] TRACE: Added configmap [haproxy-config] for namespace [default]=ConfigMap[default:haproxy-config]: endpoints:
- type: prometheus
  enabled: "true"
  protocol: http
  port: 9101
  path: /metrics
  tls: {}
  credentials: {}
  collection_interval: 30s
  tags: {}
  metrics: []
I0316 09:25:48.562682       1 discovery.go:175] DEBUG: Changed configmap [haproxy-config] for namespace [default] does not affect pod [siy05ezb/default/hawkular-openshift-agent-11vyg/cd22f65e-0a28-11e7-9706-0090fae9dd7a]
I0316 09:26:18.563499       1 prometheus_metrics_collector.go:97] DEBUG: Told to collect all Prometheus metrics from [http://10.129.5.5:8080/metrics]
W0316 09:26:18.564122       1 metrics_collector_manager.go:181] Failed to collect metrics from [default|hawkular-openshift-agent-11vyg|prometheus|http://10.129.5.5:8080/metrics] at [Thu, 16 Mar 2017 09:26:18 +0000]. err=Failed to collect Prometheus metrics from [http://10.129.5.5:8080/metrics]. err=Cannot scrape Prometheus URL [http://10.129.5.5:8080/metrics]: err=Get http://10.129.5.5:8080/metrics: dial tcp 10.129.5.5:8080: getsockopt: connection refused

... leading all the HOSA pods to fail :

[root@siy05ez1:~]# oc get pods -l metrics-infra=agent
NAME                             READY     STATUS             RESTARTS   AGE
hawkular-openshift-agent-11vyg   0/1       CrashLoopBackOff   7          15m
hawkular-openshift-agent-67dhv   0/1       CrashLoopBackOff   7          15m
hawkular-openshift-agent-68km4   0/1       CrashLoopBackOff   7          15m
hawkular-openshift-agent-82qac   0/1       CrashLoopBackOff   7          15m
hawkular-openshift-agent-95zht   0/1       CrashLoopBackOff   7          15m
hawkular-openshift-agent-a479w   0/1       CrashLoopBackOff   7          15m
hawkular-openshift-agent-au8os   0/1       CrashLoopBackOff   7          15m
hawkular-openshift-agent-k4y5r   0/1       CrashLoopBackOff   7          15m
hawkular-openshift-agent-tyh2m   0/1       CrashLoopBackOff   7          15m
hawkular-openshift-agent-xph04   0/1       CrashLoopBackOff   7          15m

@ljuaneda
Copy link
Author

@jmazzitelli much better with f5857cc, the panic error is gone 🥇

DEBUG: Received non-float value [] for metric [java.lang:name=PS MarkSweep,type=GarbageCollector#LastGcInfo#duration]

I0316 10:11:15.668939       1 jolokia_metrics_collector.go:82] DEBUG: Told to collect [4] Jolokia metrics from [https://10.130.5.3:8778/jolokia/]
I0316 10:11:15.669716       1 jolokia_metrics_collector.go:105] TRACE: Making bulk Jolokia request from [https://10.130.5.3:8778/jolokia/]:
[{"type":"read","mbean":"java.lang:type=Threading","attribute":"ThreadCount"},{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"used"},{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"max"},{"type":"read","mbean":"java.lang:name=PS MarkSweep,type=GarbageCollector","attribute":"LastGcInfo","path":"duration"}]
I0316 10:11:15.669751       1 jolokia_request.go:80] TRACE: Sending bulk Jolokia request to [https://10.130.5.3:8778/jolokia/] with [4] individual requests:
[{"type":"read","mbean":"java.lang:type=Threading","attribute":"ThreadCount"},{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"used"},{"type":"read","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","path":"max"},{"type":"read","mbean":"java.lang:name=PS MarkSweep,type=GarbageCollector","attribute":"LastGcInfo","path":"duration"}]
I0316 10:11:15.769652       1 jolokia_request.go:120] TRACE: Received Jolokia response:
[{"status":200,"request":{"attribute":"ThreadCount","mbean":"java.lang:type=Threading","type":"read"},"timestamp":1489659075,"value":162},{"status":200,"request":{"attribute":"HeapMemoryUsage","mbean":"java.lang:type=Memory","path":"used","type":"read"},"timestamp":1489659075,"value":1.80687088e+08},{"status":200,"request":{"attribute":"HeapMemoryUsage","mbean":"java.lang:type=Memory","path":"max","type":"read"},"timestamp":1489659075,"value":5.3215232e+08},{"status":200,"request":{"attribute":"LastGcInfo","mbean":"java.lang:name=PS MarkSweep,type=GarbageCollector","path":"duration","type":"read"},"timestamp":1489659075}]
I0316 10:11:15.769672       1 jolokia_metrics_collector.go:136] DEBUG: Received non-float value [<nil>] for metric [java.lang:name=PS MarkSweep,type=GarbageCollector#LastGcInfo#duration] from Jolokia endpoint [https://10.130.5.3:8778/jolokia/].
I0316 10:11:15.769753       1 jolokia_metrics_collector.go:154] TRACE: Jolokia metrics collected from endpoint [https://10.130.5.3:8778/jolokia/]:
{asse-qt9stome1 counter java.lang:type=Threading#ThreadCount [{2017-03-16 10:11:15.668901219 +0000 UTC 162 map[]}]}
{asse-qt9stome1 gauge java.lang:type=Memory#HeapMemoryUsage#used [{2017-03-16 10:11:15.668901219 +0000 UTC 1.80687088e+08 map[]}]}
{asse-qt9stome1 gauge java.lang:type=Memory#HeapMemoryUsage#max [{2017-03-16 10:11:15.668901219 +0000 UTC 5.3215232e+08 map[]}]}
==TOTAL JOLOKIA METRICS COLLECTED=3
I0316 10:11:15.769801       1 metrics_collector_manager.go:359] TRACE: Need to create/update [3] metric definitions for endpoint [asse-qt9stome1|rnx-intranetsf-slot1-23-d9mxk|jolokia|https://10.130.5.3:8778/jolokia/]
I0316 10:11:15.769828       1 metrics_collector_manager.go:456] TRACE: Collecting [3] metric details for endpoint [Endpoint: type=[jolokia], enabled=[true], url=[https://10.130.5.3:8778/jolokia/], coll_int=[], tenant=[asse-qt9stome1], tags=[map[]], metrics=[[Metric: id=[VM Thread Count], name=[java.lang:type=Threading#ThreadCount], type=[counter], units=[], tags=[map[]] Metric: id=[VM Heap Memory Used], name=[java.lang:type=Memory#HeapMemoryUsage#used], type=[gauge], units=[], tags=[map[]] Metric: id=[VM Heap Memory Max], name=[java.lang:type=Memory#HeapMemoryUsage#max], type=[gauge], units=[], tags=[map[]] Metric: id=[VM GC Marksweep duration], name=[java.lang:name=PS MarkSweep,type=GarbageCollector#LastGcInfo#duration], type=[gauge], units=[], tags=[map[]]]]]
I0316 10:11:15.769922       1 metrics_collector_manager.go:598] TRACE: [3] metric definitions being declared for endpoint [Endpoint: type=[jolokia], enabled=[true], url=[https://10.130.5.3:8778/jolokia/], coll_int=[], tenant=[asse-qt9stome1], tags=[map[]], metrics=[[Metric: id=[VM Thread Count], name=[java.lang:type=Threading#ThreadCount], type=[counter], units=[], tags=[map[]] Metric: id=[VM Heap Memory Used], name=[java.lang:type=Memory#HeapMemoryUsage#used], type=[gauge], units=[], tags=[map[]] Metric: id=[VM Heap Memory Max], name=[java.lang:type=Memory#HeapMemoryUsage#max], type=[gauge], units=[], tags=[map[]] Metric: id=[VM GC Marksweep duration], name=[java.lang:name=PS MarkSweep,type=GarbageCollector#LastGcInfo#duration], type=[gauge], units=[], tags=[map[]]]]]
I0316 10:11:15.769986       1 metrics_storage.go:105] TRACE: Asked to store metric definition: metrics.MetricDefinition{Tenant:"asse-qt9stome1", Type:"counter", ID:"pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Thread Count", Tags:map[string]string{"pod_id":"8a5a25d8-09a7-11e7-9706-0090fae9dd7a", "namespace_id":"7f456756-f901-11e6-b8bb-001a64761280", "pod_ip":"10.130.5.3", "units":"", "hostname":"", "labels":"ccx:rnx,deployment:rnx-intranetsf-slot1-23,deploymentConfig:rnx-intranetsf-slot1,deploymentconfig:rnx-intranetsf-slot1,application:rnx-intranetsf-slot1", "collector":"hawkular_openshift_agent", "pod_name":"rnx-intranetsf-slot1-23-d9mxk", "node_name":"siy05ezd", "description":"jvm/thread_count", "pod_namespace":"asse-qt9stome1", "custom_metric":"true", "metric_name":"java.lang:type=Threading#ThreadCount", "host_ip":"220.4.104.23", "type":"pod", "namespace_name":"asse-qt9stome1"}, RetentionTime:0}
I0316 10:11:15.802230       1 metrics_storage.go:121] DEBUG: Deleting obsolete tags from metric definition [pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Thread Count] of type [counter] in tenant [asse-qt9stome1]
I0316 10:11:15.802270       1 metrics_storage.go:134] DEBUG: Updating tags for metric definition [pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Thread Count] of type [counter] in tenant [asse-qt9stome1]
I0316 10:11:15.803446       1 metrics_storage.go:167] DEBUG: Stored datapoints for [3] metrics
I0316 10:11:15.803504       1 metrics_storage.go:170] TRACE: Stored [1] [counter] datapoints for metric named [pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Thread Count]: [{2017-03-16 10:11:15.668901219 +0000 UTC 162 map[]}]
I0316 10:11:15.803521       1 metrics_storage.go:170] TRACE: Stored [1] [gauge] datapoints for metric named [pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Heap Memory Used]: [{2017-03-16 10:11:15.668901219 +0000 UTC 1.80687088e+08 map[]}]
I0316 10:11:15.803535       1 metrics_storage.go:170] TRACE: Stored [1] [gauge] datapoints for metric named [pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Heap Memory Max]: [{2017-03-16 10:11:15.668901219 +0000 UTC 5.3215232e+08 map[]}]
I0316 10:11:15.806078       1 metrics_storage.go:105] TRACE: Asked to store metric definition: metrics.MetricDefinition{Tenant:"asse-qt9stome1", Type:"gauge", ID:"pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Heap Memory Used", Tags:map[string]string{"description":"jvm/heap_memory_used", "custom_metric":"true", "pod_name":"rnx-intranetsf-slot1-23-d9mxk", "units":"", "hostname":"", "metric_name":"java.lang:type=Memory#HeapMemoryUsage#used", "pod_id":"8a5a25d8-09a7-11e7-9706-0090fae9dd7a", "collector":"hawkular_openshift_agent", "pod_ip":"10.130.5.3", "pod_namespace":"asse-qt9stome1", "labels":"ccx:rnx,deployment:rnx-intranetsf-slot1-23,deploymentConfig:rnx-intranetsf-slot1,deploymentconfig:rnx-intranetsf-slot1,application:rnx-intranetsf-slot1", "host_ip":"220.4.104.23", "namespace_name":"asse-qt9stome1", "node_name":"siy05ezd", "type":"pod", "namespace_id":"7f456756-f901-11e6-b8bb-001a64761280"}, RetentionTime:0}
I0316 10:11:15.809890       1 metrics_storage.go:121] DEBUG: Deleting obsolete tags from metric definition [pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Heap Memory Used] of type [gauge] in tenant [asse-qt9stome1]
I0316 10:11:15.809908       1 metrics_storage.go:134] DEBUG: Updating tags for metric definition [pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Heap Memory Used] of type [gauge] in tenant [asse-qt9stome1]
I0316 10:11:15.812996       1 metrics_storage.go:105] TRACE: Asked to store metric definition: metrics.MetricDefinition{Tenant:"asse-qt9stome1", Type:"gauge", ID:"pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Heap Memory Max", Tags:map[string]string{"pod_namespace":"asse-qt9stome1", "units":"", "pod_id":"8a5a25d8-09a7-11e7-9706-0090fae9dd7a", "hostname":"", "metric_name":"java.lang:type=Memory#HeapMemoryUsage#max", "host_ip":"220.4.104.23", "namespace_name":"asse-qt9stome1", "type":"pod", "labels":"ccx:rnx,deployment:rnx-intranetsf-slot1-23,deploymentConfig:rnx-intranetsf-slot1,deploymentconfig:rnx-intranetsf-slot1,application:rnx-intranetsf-slot1", "namespace_id":"7f456756-f901-11e6-b8bb-001a64761280", "node_name":"siy05ezd", "custom_metric":"true", "pod_name":"rnx-intranetsf-slot1-23-d9mxk", "description":"jvm/heap_memory_max", "collector":"hawkular_openshift_agent", "pod_ip":"10.130.5.3"}, RetentionTime:0}
I0316 10:11:15.816436       1 metrics_storage.go:121] DEBUG: Deleting obsolete tags from metric definition [pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Heap Memory Max] of type [gauge] in tenant [asse-qt9stome1]
I0316 10:11:15.816452       1 metrics_storage.go:134] DEBUG: Updating tags for metric definition [pod/8a5a25d8-09a7-11e7-9706-0090fae9dd7a/custom/VM Heap Memory Max] of type [gauge] in tenant [asse-qt9stome1]

Now stays the side effect to deal with :

I0316 10:18:39.284446       1 metrics_collector_manager.go:386] STOP collecting metrics from [default|hawkular-openshift-agent-078bw|prometheus|http://10.131.5.9:8080/metrics]
I0316 10:18:39.284496       1 metrics_collector_manager.go:140] START collecting metrics from [default|hawkular-openshift-agent-078bw|prometheus|http://10.131.5.9:8080/metrics] every [30s]
I0316 10:19:09.285431       1 prometheus_metrics_collector.go:97] DEBUG: Told to collect all Prometheus metrics from [http://10.131.5.9:8080/metrics]
W0316 10:19:09.285870       1 metrics_collector_manager.go:186] Failed to collect metrics from [default|hawkular-openshift-agent-078bw|prometheus|http://10.131.5.9:8080/metrics] at [Thu, 16 Mar 2017 10:19:09 +0000]. err=Failed to collect Prometheus metrics from [http://10.131.5.9:8080/metrics]. err=Cannot scrape Prometheus URL [http://10.131.5.9:8080/metrics]: err=Get http://10.131.5.9:8080/metrics: dial tcp 10.131.5.9:8080: getsockopt: connection refused

@jmazzitelli
Copy link
Contributor

OK, so that MBean had no data returned by Jolokia:

{"status":200,"request":{"attribute":"LastGcInfo","mbean":"java.lang:name=PS MarkSweep,type=GarbageCollector","path":"duration","type":"read"},"timestamp":1489655961}

That explains the nil. Compare that with an MBean that does have a value:

{"status":200,"request":{"attribute":"HeapMemoryUsage","mbean":"java.lang:type=Memory","path":"max","type":"read"},"timestamp":1489655961,"value":5.3215232e+08}

So at minimum this fixes the issue where an MBean has no metric data yet.

@jmazzitelli
Copy link
Contributor

One note: it looks like Jolokia will convert int values to floats so there is nothing I would need to do further, at least it seems numeric values are getting converted to floats automatically:

{"status":200,"request":{"attribute":"ThreadCount","mbean":"java.lang:type=Threading","type":"read"},"timestamp":1489655961,"value":162}

There the value is an int - 162. And notice we are able to get it from Jolokia using our code that expects float:

I0316 10:11:15.769753       1 jolokia_metrics_collector.go:154] TRACE: Jolokia metrics collected from endpoint [https://10.130.5.3:8778/jolokia/]:
{asse-qt9stome1 counter java.lang:type=Threading#ThreadCount [{2017-03-16 10:11:15.668901219 +0000 UTC 162 map[]}]}

So that's just confirmation that we can handle numeric values of different types and Jolokia will convert it for us.

@jmazzitelli
Copy link
Contributor

1 metrics_collector_manager.go:186] Failed to collect metrics from [default|hawkular-openshift-agent-078bw|prometheus|http://10.131.5.9:8080/metrics] at [Thu, 16 Mar 2017 10:19:09 +0000]. err=Failed to collect Prometheus metrics from [http://10.131.5.9:8080/metrics]. err=Cannot scrape Prometheus URL [http://10.131.5.9:8080/metrics]: err=Get http://10.131.5.9:8080/metrics: dial tcp 10.131.5.9:8080: getsockopt: connection refused

The "connection refused" tells me the metrics endpoint isn't even running?

@ljuaneda What does your agent config map look like? Specifically, the "emitter" section. Here's the Go code to show you what's allowed. The metrics emitter should be on by default, unless "metrics_enabled" is explicitly false. What other settings do you have in your agent config?

Do you see any messages at startup that indicate something is wrong with the metrics emitter?

If it was a problem with the SSL setup, I would have expected some other error - not "getsockopt: connection refused" - that just seems like http://10.131.5.9:8080/metrics is not available.

@ljuaneda
Copy link
Author

This is the content of the HOSA configmap :

kind: ConfigMap
metadata:
  name: hawkular-openshift-agent-configuration
  namespace: default
  labels:
    metrics-infra: agent
data:
  config.yaml: |
    kubernetes:
      tenant: ${POD:namespace_name}
    hawkular_server:
      url: https://metrics.osev3-qt9-infra.hpmetier.sf.intra.laposte.fr     
      credentials:
        username: secret:openshift-infra/hawkular-metrics-account/hawkular-metrics.username
        password: secret:openshift-infra/hawkular-metrics-account/hawkular-metrics.password
      ca_cert_file: secret:openshift-infra/hawkular-metrics-certificate/hawkular-metrics-ca.certificate
    emitter:
      status_enabled: true
    identity:
      cert_file: /master-proxy/master.proxy-client.crt
      private_key_file: /master-proxy/master.proxy-client.key
    collector:
      minimum_collection_interval: 10s
      default_collection_interval: 30s
      metric_id_prefix: pod/${POD:uid}/custom/
      tags:
        metric_name: ${METRIC:name}
        description: ${METRIC:description}
        units: ${METRIC:units}
        namespace_id: ${POD:namespace_uid}
        namespace_name: ${POD:namespace_name}
        node_name: ${POD:node_name}
        pod_id: ${POD:uid}
        pod_ip: ${POD:ip}
        pod_name: ${POD:name}
        pod_namespace: ${POD:namespace_name}
        hostname: ${POD:hostname}
        host_ip: ${POD:host_ip}
        labels: ${POD:labels}
        type: pod
        collector: hawkular_openshift_agent
        custom_metric: true
  hawkular-openshift-agent: |
    endpoints:
    - type: prometheus
      protocol: "http"
      port: 8080
      path: /metrics
      collection_interval: 30s

@ljuaneda
Copy link
Author

When you remove the identity section, HOSA can scrape its own metrics, otherwise it fails.
It would be great to move this section in the pod configmap, as it is used to access the pod endpoint.

@jmazzitelli
Copy link
Contributor

The "identity" is supposed to identify the agent - a single certificate/private key identifies something and the agent "identity" identifies this agent. So it only has one identity (or it is supposed to). The idea is you assign the agent a certificate/private key and that is the agent's own certificate identifying itself so every HTTPS request to every endpoint the agent makes it will identify itself with that certificate.

This problem is all about this client cert issue I believe - if OpenShift fixes that, the problem can be solved. This has been discussed several times with some OpenShift folks. Hopefully this kind of problem gets solved soon.

That said, because this issue comes up so often, we have this issue (issue #75) which we may have to implement because this is really getting to be a problem for people.

@jmazzitelli
Copy link
Contributor

This issue with the panic on nil jolokia values has been fixed and released (1.3.1.Final). The other issue recently mentioned is tracked in issue #75 already. So we can close this issue.

@ljuaneda
Copy link
Author

@jmazzitelli, even when I explicitly desactivate all emiter metrics, it ends failing the HOSA container.

I0316 14:08:15.281103       1 emitter_server.go:50] Agent emitter will NOT emit metrics
I0316 14:08:15.281109       1 emitter_server.go:64] Agent emitter will NOT emit status
I0316 14:08:15.281112       1 emitter_server.go:72] Agent emitter will NOT provide a health probe
I0316 14:08:15.281117       1 emitter_server.go:76] Agent emitter endpoint has been disabled

unfortunatly this time, no track of an error...

@jmazzitelli
Copy link
Contributor

disable the hawkualr agent endpoint collection so it doesn't even try to collect metrics from it.

  hawkular-openshift-agent: |
    endpoints:
    - type: prometheus
      enabled: false      <------- ADD THIS
      protocol: "http"

@ljuaneda
Copy link
Author

ljuaneda commented Mar 16, 2017

I0316 16:49:29.298950       1 discovery.go:175] DEBUG: Changed configmap [haproxy-config] for namespace [default] does not affect pod [siy05eza/default/hawkular-openshift-agent-2mqhd/ee6319d7-0a65-11e7-9706-0090fae9dd7a]
I0316 16:49:29.298962       1 discovery.go:365] DEBUG: Detected a new configmap [hawkular-openshift-agent-configuration] for namespace [default]
I0316 16:49:29.299060       1 discovery.go:375] TRACE: Added configmap [hawkular-openshift-agent-configuration] for namespace [default]=ConfigMap[default:hawkular-openshift-agent-configuration]: endpoints:
- type: prometheus
  enabled: "false"
  protocol: http
  port: 8080
  path: /metrics
  tls: {}
  credentials: {}
  collection_interval: 30s
  tags: {}
  metrics: []
I0316 16:49:29.299071       1 discovery.go:172] DEBUG: Changed configmap [hawkular-openshift-agent-configuration] for namespace [default] affects pod [siy05eza/default/hawkular-openshift-agent-2mqhd/ee6319d7-0a65-11e7-9706-0090fae9dd7a] with volume: [hawkular-openshift-agent=hawkular-openshift-agent-configuration]
I0316 16:49:29.299123       1 node_event_consumer.go:203] Will not start collecting for endpoint [http://10.131.1.32:8080/metrics] in pod [siy05eza/default/hawkular-openshift-agent-2mqhd/ee6319d7-0a65-11e7-9706-0090fae9dd7a] - it has been disabled.

The problem seems elsewhere because HOSA is sill falling in CrashLoopBackOff because the livenessprobes can't http://:8080/health :

Events:
  FirstSeen     LastSeen        Count   From                    SubobjectPath                                   Type            Reason          Message
  ---------     --------        -----   ----                    -------------                                   --------        ------          -------
  14m           14m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Created         Created container with docker id a4e96fa8a786; Security:[seccomp=unconfined]
  14m           14m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Started         Started container with docker id a4e96fa8a786
  12m           12m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Killing         Killing container with docker id a4e96fa8a786: pod "hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)" container "hawkular-openshift-agent" is unhealthy, it will be killed and re-created.
  12m           12m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Created         Created container with docker id 9370e16f7efd; Security:[seccomp=unconfined]
  12m           12m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Started         Started container with docker id 9370e16f7efd
  11m           11m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Killing         Killing container with docker id 9370e16f7efd: pod "hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)" container "hawkular-openshift-agent" is unhealthy, it will be killed and re-created.
  11m           11m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Created         Created container with docker id 6e5cf53a7bf2; Security:[seccomp=unconfined]
  11m           11m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Started         Started container with docker id 6e5cf53a7bf2
  10m           10m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Killing         Killing container with docker id 6e5cf53a7bf2: pod "hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)" container "hawkular-openshift-agent" is unhealthy, it will be killed and re-created.
  10m           10m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Created         Created container with docker id a73eebe41698; Security:[seccomp=unconfined]
  10m           10m             1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Started         Started container with docker id a73eebe41698
  9m            9m              1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Killing         Killing container with docker id a73eebe41698: pod "hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)" container "hawkular-openshift-agent" is unhealthy, it will be killed and re-created.
  9m            9m              1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Started         Started container with docker id 90b9bcc7e61c
  9m            9m              1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Created         Created container with docker id 90b9bcc7e61c; Security:[seccomp=unconfined]
  8m            8m              1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Killing         Killing container with docker id 90b9bcc7e61c: pod "hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)" container "hawkular-openshift-agent" is unhealthy, it will be killed and re-created.
  8m            8m              1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Created         Created container with docker id 2237b698e5dc; Security:[seccomp=unconfined]
  8m            8m              1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Started         Started container with docker id 2237b698e5dc
  7m            7m              1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal          Killing         Killing container with docker id 2237b698e5dc: pod "hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)" container "hawkular-openshift-agent" is unhealthy, it will be killed and re-created.
  7m            6m              7       {kubelet siy05ezb}                                                      Warning         FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "hawkular-openshift-agent" with CrashLoopBackOff: "Back-off 1m20s restarting failed container=hawkular-openshift-agent pod=hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)"

  5m    5m      1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal  Created         Created container with docker id b55942db4860; Security:[seccomp=unconfined]
  5m    5m      1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal  Started         Started container with docker id b55942db4860
  5m    5m      1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal  Killing         Killing container with docker id b55942db4860: pod "hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)" container "hawkular-openshift-agent" is unhealthy, it will be killed and re-created.
  5m    2m      13      {kubelet siy05ezb}                                                      Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "hawkular-openshift-agent" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=hawkular-openshift-agent pod=hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)"

  14m   2m      8       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal  Pulled          Successfully pulled image "hawkular/hawkular-openshift-agent:dev"
  14m   2m      8       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal  Pulling         pulling image "hawkular/hawkular-openshift-agent:dev"
  2m    2m      1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal  Created         Created container with docker id 541ac564bfe6; Security:[seccomp=unconfined]
  2m    2m      1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal  Started         Started container with docker id 541ac564bfe6
  13m   1m      10      {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Warning Unhealthy       Liveness probe failed: Get http://10.129.5.31:8080/health: dial tcp 10.129.5.31:8080: getsockopt: connection refused
  1m    1m      1       {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Normal  Killing         Killing container with docker id 541ac564bfe6: pod "hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)" container "hawkular-openshift-agent" is unhealthy, it will be killed and re-created.
  7m    1s      30      {kubelet siy05ezb}      spec.containers{hawkular-openshift-agent}       Warning BackOff         Back-off restarting failed docker container
  1m    1s      10      {kubelet siy05ezb}                                                      Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "hawkular-openshift-agent" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=hawkular-openshift-agent pod=hawkular-openshift-agent-53euz_default(e0573232-0a68-11e7-9706-0090fae9dd7a)"

@jmazzitelli
Copy link
Contributor

jmazzitelli commented Mar 16, 2017

Based on your previous log output:

I0316 14:08:15.281112       1 emitter_server.go:72] Agent emitter will NOT provide a health probe

You disabled the health probe endpoint - so yes the agent pod will look like it is not ready.

If you disable the health probe endpoint, you must remove the health probe definition in your agent's pod yaml definition.

@ljuaneda
Copy link
Author

ljuaneda commented Mar 16, 2017 via email

@jmazzitelli
Copy link
Contributor

sigh I know why : https://github.com/hawkular/hawkular-openshift-agent/blob/master/emitter/emitter_server.go#L99

What is happening is the agent is given its identity so when it creates its server for accepting requests for health and metrics, it creates that server behind https with its identity as its server cert. But apparently, that isn't good enough (I believe the cert you are using is a client-cert which probably can't be used here).

I'm just guessing for now. But rather than continue this discussion in this git issue (which is fixed - the panic when collecting the jolokia metric no longer happens) create another git issue about this or start a thread on hawkular-dev (which might be better because then we'll get everyone to hear what is going on about this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants