klusterlet-work-agent instability #109

hatfieldbrian · 2021-04-13T16:11:51Z

Commands used to depoly a hub and managed cluster using minikube:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/cluster-registry/master/cluster-registry-crd.yaml
git clone https://github.com/open-cluster-management/registration-operator.git
cd ./registration-operator
git checkout master
make deploy-hub
kubectl wait deployments --all --for condition=Available -n olm --timeout 1m
kubectl wait deployments --all --for condition=Available -n open-cluster-management
kubectl wait deployments --all --for condition=Available -n open-cluster-management-hub --timeout 2m
minikube start --driver=kvm2 --profile=cluster1
kubectl config view --flatten --context=hub --minify >/tmp/hub-config
kubectl config view --flatten --context=cluster1 --minify >/tmp/cluster1-config
kubectl config use-context cluster1
export KLUSTERLET_KIND_KUBECONFIG=/tmp/cluster1-config
export KIND_CLUSTER=cluster1
export HUB_KIND_KUBECONFIG=/tmp/hub-config
make deploy-spoke-kind
kubectl wait deployments --all --for condition=Available -n open-cluster-management
kubectl wait deployments --all --for condition=Available -n open-cluster-management-agent

Command to gather state:

NAME                                             READY   STATUS             RESTARTS   AGE
klusterlet-registration-agent-85f68b68c4-hr7ln   1/1     Running            0          32m
klusterlet-registration-agent-85f68b68c4-pkg98   1/1     Running            0          32m
klusterlet-registration-agent-85f68b68c4-t756z   1/1     Running            0          32m
klusterlet-work-agent-786b47fb99-7cp4x           0/1     CrashLoopBackOff   7          32m
klusterlet-work-agent-786b47fb99-8wcdb           0/1     CrashLoopBackOff   7          32m
klusterlet-work-agent-786b47fb99-vx2qk           0/1     CrashLoopBackOff   7          32m

$ minikube profile list
|----------|-----------|---------|----------------|------|---------|---------|-------|
| Profile  | VM Driver | Runtime |       IP       | Port | Version | Status  | Nodes |
|----------|-----------|---------|----------------|------|---------|---------|-------|
| cluster1 | kvm2      | docker  | 192.168.39.92  | 8443 | v1.20.2 | Running |     1 |
| hub      | kvm2      | docker  | 192.168.39.156 | 8443 | v1.20.2 | Running |     1 |
|----------|-----------|---------|----------------|------|---------|---------|-------|

$ kubectl describe pod klusterlet-work-agent-786b47fb99-7cp4x -n open-cluster-management-agent
Name:         klusterlet-work-agent-786b47fb99-7cp4x
Namespace:    open-cluster-management-agent
Priority:     0
Node:         cluster1/192.168.39.92
Start Time:   Tue, 13 Apr 2021 08:16:03 -0700
Labels:       app=klusterlet-manifestwork-agent
              pod-template-hash=786b47fb99
Annotations:  <none>
Status:       Running
IP:           172.17.0.16
IPs:
  IP:           172.17.0.16
Controlled By:  ReplicaSet/klusterlet-work-agent-786b47fb99
Containers:
  klusterlet-manifestwork-agent:
    Container ID:  docker://0247a6d965fc6040fab94d2cacf7bfbea915ac92b3b7cb54ad1e2c54da05ceb9
    Image:         quay.io/open-cluster-management/work
    Image ID:      docker-pullable://quay.io/open-cluster-management/work@sha256:e60ddb4fdcb24932816118f564748126ff032f49354f4d430d2d46a9df22a3f5
    Port:          <none>
    Host Port:     <none>
    Args:
      /work
      agent
      --spoke-cluster-name=cluster1
      --hub-kubeconfig=/spoke/hub-kubeconfig/kubeconfig
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 13 Apr 2021 08:46:26 -0700
      Finished:     Tue, 13 Apr 2021 08:47:36 -0700
    Ready:          False
    Restart Count:  7
    Requests:
      cpu:        100m
      memory:     128Mi
    Liveness:     http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /spoke/hub-kubeconfig from hub-kubeconfig-secret (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from klusterlet-work-sa-token-ls59n (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  hub-kubeconfig-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hub-kubeconfig-secret
    Optional:    false
  klusterlet-work-sa-token-ls59n:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  klusterlet-work-sa-token-ls59n
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  40m                   default-scheduler  Successfully assigned open-cluster-management-agent/klusterlet-work-agent-786b47fb99-7cp4x to cluster1
  Normal   Pulled     39m                   kubelet            Successfully pulled image "quay.io/open-cluster-management/work" in 29.427044917s
  Normal   Pulled     38m                   kubelet            Successfully pulled image "quay.io/open-cluster-management/work" in 1.508110447s
  Normal   Pulled     37m                   kubelet            Successfully pulled image "quay.io/open-cluster-management/work" in 1.442129606s
  Normal   Pulled     33m                   kubelet            Successfully pulled image "quay.io/open-cluster-management/work" in 1.516616046s
  Normal   Started    33m (x4 over 39m)     kubelet            Started container klusterlet-manifestwork-agent
  Warning  Unhealthy  25m                   kubelet            Readiness probe failed: Get "https://172.17.0.16:8443/healthz": dial tcp 172.17.0.16:8443: connect: connection refused
  Normal   Pulled     25m                   kubelet            Successfully pulled image "quay.io/open-cluster-management/work" in 1.41603949s
  Normal   Created    25m (x5 over 39m)     kubelet            Created container klusterlet-manifestwork-agent
  Normal   Pulling    10m (x8 over 40m)     kubelet            Pulling image "quay.io/open-cluster-management/work"
  Warning  BackOff    5m17s (x72 over 37m)  kubelet            Back-off restarting failed container


$ kubectl logs -p klusterlet-work-agent-786b47fb99-7cp4x -n open-cluster-management-agent
W0413 15:46:26.585511       1 cmd.go:204] Using insecure, self-signed certificates
I0413 15:46:27.236528       1 observer_polling.go:159] Starting file observer
I0413 15:46:27.309689       1 builder.go:240] work-agent version v2.3.0-2021-03-29-10-25-04-2-g4832b07-4832b07
W0413 15:46:27.803860       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected.
W0413 15:46:27.803995       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected.
I0413 15:46:27.807585       1 leaderelection.go:243] attempting to acquire leader lease open-cluster-management-agent/work-agent-lock...
I0413 15:46:27.815226       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
0413 15:46:27.816318       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0413 15:46:27.816743       1 secure_serving.go:197] Serving securely on [::]:8443
I0413 15:46:27.817189       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0413 15:46:27.815911       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:46:27.817987       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:46:27.815926       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:46:27.819358       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:46:27.816294       1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/serving-cert-328065387/tls.crt::/tmp/serving-cert-328065387/tls.key
I0413 15:46:27.917412       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0413 15:46:27.920012       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:46:27.920228       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:47:36.163633       1 leaderelection.go:253] successfully acquired lease open-cluster-management-agent/work-agent-lock
I0413 15:47:36.169819       1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"open-cluster-management-agent", Name:"work-agent-lock", UID:"972836a0-5311-4022-8f44-09e7170d5a61", APIVersion:"v1", ResourceVersion:"3585", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' klusterlet-work-agent-786b47fb99-7cp4x_ce3f539c-5804-4d88-997f-7ba502de845d became leader
W0413 15:47:36.188407       1 builder.go:99] graceful termination failed, controllers failed with error: invalid configuration: [unable to read client-cert /spoke/hub-kubeconfig/tls.crt for default-auth due to open /spoke/hub-kubeconfig/tls.crt: no such file or directory, unable to read client-key /spoke/hub-kubeconfig/tls.key for default-auth due to open /spoke/hub-kubeconfig/tls.key: no such file or directory]


$ go version
go version go1.16.2 linux/amd64
$ kustomize version
{Version:kustomize/v4.0.5 GitCommit:9e8e7a7fe99ec9fbf801463e8607928322fc5245 BuildDate:2021-03-08T20:53:03Z GoOs:linux GoArch:amd64}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
$ minikube version
minikube version: v1.18.1
commit: 09ee84d530de4a92f00f1c5dbc34cead092b95bc

I wonder if the last line of the logs explains why a terminaton failed, or why a termination occurred.
I'm willing to run more commands to debug. This issue may be reproduced easily.

master branch was used. head commit identifier: daf5b5cf51a7da18243c6f99cee94f8197f53b90

The text was updated successfully, but these errors were encountered:

hatfieldbrian · 2021-04-14T14:15:00Z

tried on an Ubuntu 18.04.4 LTS bare metal instance and see a different "controller fail" reason:

W0414 14:03:26.266129       1 cmd.go:204] Using insecure, self-signed certificates
I0414 14:03:26.770798       1 observer_polling.go:159] Starting file observer
I0414 14:03:26.797629       1 builder.go:240] work-agent version v2.3.0-2021-03-29-10-25-04-2-g4832b07-4832b07
W0414 14:03:27.120747       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected.
W0414 14:03:27.120760       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected.
I0414 14:03:27.121423       1 leaderelection.go:243] attempting to acquire leader lease open-cluster-management-agent/work-agent-lock...
I0414 14:03:27.124234       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0414 14:03:27.124289       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0414 14:03:27.124366       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0414 14:03:27.124393       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0414 14:03:27.124430       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0414 14:03:27.124447       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0414 14:03:27.124950       1 secure_serving.go:197] Serving securely on [::]:8443
I0414 14:03:27.125587       1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/serving-cert-199583371/tls.crt::/tmp/serving-cert-199583371/tls.key
I0414 14:03:27.125647       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0414 14:03:27.224518       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0414 14:03:27.224528       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0414 14:03:27.224573       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0414 14:04:35.364901       1 leaderelection.go:253] successfully acquired lease open-cluster-management-agent/work-agent-lock
W0414 14:04:35.365569       1 builder.go:99] graceful termination failed, controllers failed with error: stat /spoke/hub-kubeconfig/kubeconfig: no such file or directory

it's not clear to me whether a file's absence caused the controller termination or its graceful termination failure. seems like the former.

hatfieldbrian · 2021-04-14T22:28:45Z

The new error matches the one at https://access.redhat.com/solutions/5355821

hatfieldbrian · 2021-04-14T23:53:50Z

In my case, however, the inaccessible host is identified by an IP address not a hostname:

E0414 17:29:47.964440 1 reflector.go:138] k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.CertificateSigningRequest: failed to list *v1.CertificateSigningRequest: Get "https://192.168.39.149:8443/apis/certificates.k8s.io/v1/certificatesigningrequests?limit=500&resourceVersion=0": dial tcp 192.168.39.149:8443: connect: connection refused

so seems it is not a DNS issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

klusterlet-work-agent instability #109

klusterlet-work-agent instability #109

hatfieldbrian commented Apr 13, 2021 •

edited

Loading

hatfieldbrian commented Apr 14, 2021 •

edited

Loading

hatfieldbrian commented Apr 14, 2021

hatfieldbrian commented Apr 14, 2021 •

edited

Loading

klusterlet-work-agent instability #109

klusterlet-work-agent instability #109

Comments

hatfieldbrian commented Apr 13, 2021 • edited Loading

hatfieldbrian commented Apr 14, 2021 • edited Loading

hatfieldbrian commented Apr 14, 2021

hatfieldbrian commented Apr 14, 2021 • edited Loading

hatfieldbrian commented Apr 13, 2021 •

edited

Loading

hatfieldbrian commented Apr 14, 2021 •

edited

Loading

hatfieldbrian commented Apr 14, 2021 •

edited

Loading