Skip to content
This repository has been archived by the owner on Oct 17, 2024. It is now read-only.

klusterlet-work-agent instability #109

Open
hatfieldbrian opened this issue Apr 13, 2021 · 3 comments
Open

klusterlet-work-agent instability #109

hatfieldbrian opened this issue Apr 13, 2021 · 3 comments

Comments

@hatfieldbrian
Copy link

hatfieldbrian commented Apr 13, 2021

Commands used to depoly a hub and managed cluster using minikube:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/cluster-registry/master/cluster-registry-crd.yaml
git clone https://github.com/open-cluster-management/registration-operator.git
cd ./registration-operator
git checkout master
make deploy-hub
kubectl wait deployments --all --for condition=Available -n olm --timeout 1m
kubectl wait deployments --all --for condition=Available -n open-cluster-management
kubectl wait deployments --all --for condition=Available -n open-cluster-management-hub --timeout 2m
minikube start --driver=kvm2 --profile=cluster1
kubectl config view --flatten --context=hub --minify >/tmp/hub-config
kubectl config view --flatten --context=cluster1 --minify >/tmp/cluster1-config
kubectl config use-context cluster1
export KLUSTERLET_KIND_KUBECONFIG=/tmp/cluster1-config
export KIND_CLUSTER=cluster1
export HUB_KIND_KUBECONFIG=/tmp/hub-config
make deploy-spoke-kind
kubectl wait deployments --all --for condition=Available -n open-cluster-management
kubectl wait deployments --all --for condition=Available -n open-cluster-management-agent

Command to gather state:

NAME                                             READY   STATUS             RESTARTS   AGE
klusterlet-registration-agent-85f68b68c4-hr7ln   1/1     Running            0          32m
klusterlet-registration-agent-85f68b68c4-pkg98   1/1     Running            0          32m
klusterlet-registration-agent-85f68b68c4-t756z   1/1     Running            0          32m
klusterlet-work-agent-786b47fb99-7cp4x           0/1     CrashLoopBackOff   7          32m
klusterlet-work-agent-786b47fb99-8wcdb           0/1     CrashLoopBackOff   7          32m
klusterlet-work-agent-786b47fb99-vx2qk           0/1     CrashLoopBackOff   7          32m

$ minikube profile list
|----------|-----------|---------|----------------|------|---------|---------|-------|
| Profile  | VM Driver | Runtime |       IP       | Port | Version | Status  | Nodes |
|----------|-----------|---------|----------------|------|---------|---------|-------|
| cluster1 | kvm2      | docker  | 192.168.39.92  | 8443 | v1.20.2 | Running |     1 |
| hub      | kvm2      | docker  | 192.168.39.156 | 8443 | v1.20.2 | Running |     1 |
|----------|-----------|---------|----------------|------|---------|---------|-------|

$ kubectl describe pod klusterlet-work-agent-786b47fb99-7cp4x -n open-cluster-management-agent
Name:         klusterlet-work-agent-786b47fb99-7cp4x
Namespace:    open-cluster-management-agent
Priority:     0
Node:         cluster1/192.168.39.92
Start Time:   Tue, 13 Apr 2021 08:16:03 -0700
Labels:       app=klusterlet-manifestwork-agent
              pod-template-hash=786b47fb99
Annotations:  <none>
Status:       Running
IP:           172.17.0.16
IPs:
  IP:           172.17.0.16
Controlled By:  ReplicaSet/klusterlet-work-agent-786b47fb99
Containers:
  klusterlet-manifestwork-agent:
    Container ID:  docker://0247a6d965fc6040fab94d2cacf7bfbea915ac92b3b7cb54ad1e2c54da05ceb9
    Image:         quay.io/open-cluster-management/work
    Image ID:      docker-pullable://quay.io/open-cluster-management/work@sha256:e60ddb4fdcb24932816118f564748126ff032f49354f4d430d2d46a9df22a3f5
    Port:          <none>
    Host Port:     <none>
    Args:
      /work
      agent
      --spoke-cluster-name=cluster1
      --hub-kubeconfig=/spoke/hub-kubeconfig/kubeconfig
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 13 Apr 2021 08:46:26 -0700
      Finished:     Tue, 13 Apr 2021 08:47:36 -0700
    Ready:          False
    Restart Count:  7
    Requests:
      cpu:        100m
      memory:     128Mi
    Liveness:     http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /spoke/hub-kubeconfig from hub-kubeconfig-secret (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from klusterlet-work-sa-token-ls59n (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  hub-kubeconfig-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hub-kubeconfig-secret
    Optional:    false
  klusterlet-work-sa-token-ls59n:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  klusterlet-work-sa-token-ls59n
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  40m                   default-scheduler  Successfully assigned open-cluster-management-agent/klusterlet-work-agent-786b47fb99-7cp4x to cluster1
  Normal   Pulled     39m                   kubelet            Successfully pulled image "quay.io/open-cluster-management/work" in 29.427044917s
  Normal   Pulled     38m                   kubelet            Successfully pulled image "quay.io/open-cluster-management/work" in 1.508110447s
  Normal   Pulled     37m                   kubelet            Successfully pulled image "quay.io/open-cluster-management/work" in 1.442129606s
  Normal   Pulled     33m                   kubelet            Successfully pulled image "quay.io/open-cluster-management/work" in 1.516616046s
  Normal   Started    33m (x4 over 39m)     kubelet            Started container klusterlet-manifestwork-agent
  Warning  Unhealthy  25m                   kubelet            Readiness probe failed: Get "https://172.17.0.16:8443/healthz": dial tcp 172.17.0.16:8443: connect: connection refused
  Normal   Pulled     25m                   kubelet            Successfully pulled image "quay.io/open-cluster-management/work" in 1.41603949s
  Normal   Created    25m (x5 over 39m)     kubelet            Created container klusterlet-manifestwork-agent
  Normal   Pulling    10m (x8 over 40m)     kubelet            Pulling image "quay.io/open-cluster-management/work"
  Warning  BackOff    5m17s (x72 over 37m)  kubelet            Back-off restarting failed container


$ kubectl logs -p klusterlet-work-agent-786b47fb99-7cp4x -n open-cluster-management-agent
W0413 15:46:26.585511       1 cmd.go:204] Using insecure, self-signed certificates
I0413 15:46:27.236528       1 observer_polling.go:159] Starting file observer
I0413 15:46:27.309689       1 builder.go:240] work-agent version v2.3.0-2021-03-29-10-25-04-2-g4832b07-4832b07
W0413 15:46:27.803860       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected.
W0413 15:46:27.803995       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected.
I0413 15:46:27.807585       1 leaderelection.go:243] attempting to acquire leader lease open-cluster-management-agent/work-agent-lock...
I0413 15:46:27.815226       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
0413 15:46:27.816318       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0413 15:46:27.816743       1 secure_serving.go:197] Serving securely on [::]:8443
I0413 15:46:27.817189       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0413 15:46:27.815911       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:46:27.817987       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:46:27.815926       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:46:27.819358       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:46:27.816294       1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/serving-cert-328065387/tls.crt::/tmp/serving-cert-328065387/tls.key
I0413 15:46:27.917412       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0413 15:46:27.920012       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:46:27.920228       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:47:36.163633       1 leaderelection.go:253] successfully acquired lease open-cluster-management-agent/work-agent-lock
I0413 15:47:36.169819       1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"open-cluster-management-agent", Name:"work-agent-lock", UID:"972836a0-5311-4022-8f44-09e7170d5a61", APIVersion:"v1", ResourceVersion:"3585", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' klusterlet-work-agent-786b47fb99-7cp4x_ce3f539c-5804-4d88-997f-7ba502de845d became leader
W0413 15:47:36.188407       1 builder.go:99] graceful termination failed, controllers failed with error: invalid configuration: [unable to read client-cert /spoke/hub-kubeconfig/tls.crt for default-auth due to open /spoke/hub-kubeconfig/tls.crt: no such file or directory, unable to read client-key /spoke/hub-kubeconfig/tls.key for default-auth due to open /spoke/hub-kubeconfig/tls.key: no such file or directory]


$ go version
go version go1.16.2 linux/amd64
$ kustomize version
{Version:kustomize/v4.0.5 GitCommit:9e8e7a7fe99ec9fbf801463e8607928322fc5245 BuildDate:2021-03-08T20:53:03Z GoOs:linux GoArch:amd64}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
$ minikube version
minikube version: v1.18.1
commit: 09ee84d530de4a92f00f1c5dbc34cead092b95bc

I wonder if the last line of the logs explains why a terminaton failed, or why a termination occurred.
I'm willing to run more commands to debug. This issue may be reproduced easily.

master branch was used. head commit identifier: daf5b5cf51a7da18243c6f99cee94f8197f53b90

@hatfieldbrian
Copy link
Author

hatfieldbrian commented Apr 14, 2021

tried on an Ubuntu 18.04.4 LTS bare metal instance and see a different "controller fail" reason:

W0414 14:03:26.266129       1 cmd.go:204] Using insecure, self-signed certificates
I0414 14:03:26.770798       1 observer_polling.go:159] Starting file observer
I0414 14:03:26.797629       1 builder.go:240] work-agent version v2.3.0-2021-03-29-10-25-04-2-g4832b07-4832b07
W0414 14:03:27.120747       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected.
W0414 14:03:27.120760       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected.
I0414 14:03:27.121423       1 leaderelection.go:243] attempting to acquire leader lease open-cluster-management-agent/work-agent-lock...
I0414 14:03:27.124234       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0414 14:03:27.124289       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0414 14:03:27.124366       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0414 14:03:27.124393       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0414 14:03:27.124430       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0414 14:03:27.124447       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0414 14:03:27.124950       1 secure_serving.go:197] Serving securely on [::]:8443
I0414 14:03:27.125587       1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/serving-cert-199583371/tls.crt::/tmp/serving-cert-199583371/tls.key
I0414 14:03:27.125647       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0414 14:03:27.224518       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0414 14:03:27.224528       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0414 14:03:27.224573       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0414 14:04:35.364901       1 leaderelection.go:253] successfully acquired lease open-cluster-management-agent/work-agent-lock
W0414 14:04:35.365569       1 builder.go:99] graceful termination failed, controllers failed with error: stat /spoke/hub-kubeconfig/kubeconfig: no such file or directory

it's not clear to me whether a file's absence caused the controller termination or its graceful termination failure. seems like the former.

@hatfieldbrian
Copy link
Author

The new error matches the one at https://access.redhat.com/solutions/5355821

@hatfieldbrian
Copy link
Author

hatfieldbrian commented Apr 14, 2021

In my case, however, the inaccessible host is identified by an IP address not a hostname:

E0414 17:29:47.964440 1 reflector.go:138] k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.CertificateSigningRequest: failed to list *v1.CertificateSigningRequest: Get "https://192.168.39.149:8443/apis/certificates.k8s.io/v1/certificatesigningrequests?limit=500&resourceVersion=0": dial tcp 192.168.39.149:8443: connect: connection refused

so seems it is not a DNS issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant