Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

bob2204 · 2024-08-30T11:47:34Z

Hello

With Kind 0.24 and Node 1.31.0 the Pod IP is not removed from Service EndPoint when ReadinessProbe failed, although noticed NotReadyAddress in EndPoint !

This was fine wih kind 0.23 and Node 1.30.2

Is this normal ?

Best Regards

aojea · 2024-08-30T12:27:57Z

You have to add more details and a reproducer, is not easy to understand from the comments what can be failing there

bob2204 · 2024-08-30T16:02:44Z

I apologize, what I wish to say is that the Pod IP was not remove from the service endpoint.

I use a Nginx Deployment with a ReadinessProbe with this container :

containers:
      - image: nginx:1.26
        name: nginx
        readinessProbe:
          httpGet:
            path: /livez
            port: 80
          periodSeconds: 3
          failureThreshold: 2

and a service like :

apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer

and when this ReadinessProbe failed, the Pod IP is shown "NotReadyAddress" in the EndPoint :

kubectl describe endpoints nginx 
Name:         lemp
Namespace:    default
Labels:       app=nginx
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2024-08-30T15:41:43Z
Subsets:
  Addresses:          <none>
  NotReadyAddresses:  10.32.204.60
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    <unset>  80    TCP

Events:  <none>

BUT the Pod IP 10.32.204.60 was not removed from de Service Endpoints :

kubectl describe svc nginx 
Name:                     nginx
Namespace:                default
Labels:                   app=nginx
Annotations:              <none>
Selector:                 app=nginx
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.16.42.218
IPs:                      172.16.42.218
LoadBalancer Ingress:     172.18.0.9 (Proxy)
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  31693/TCP
Endpoints:                10.32.204.60:80
Session Affinity:         None
External Traffic Policy:  Cluster
Internal Traffic Policy:  Cluster
Events:                   <none>

With Kind 0.23 and kindest/node:1.30.2, everything is OK, the Pod IP is removed from the Service EndPoints when the ReadinessProbe failed
AND with a K8S Cluster with 3 VMs and 1.31.0 everything is OK too !

Is my english clear ?

aojea · 2024-08-31T10:14:38Z

Just to understand, this works in kubernetes versions 1.30 and 1.31, only fails with Node 1.31.0 ?

bob2204 · 2024-08-31T13:41:27Z

After further investigations, I found that whatever kubernetes version is, the problem seems to be Virtualbox environnement.
I've to identical kind installations -- kind 0.24.0, kindest/node:1.31.0, calico-3.28.0 --, one on physical machine, one on Virtualbox VM :

On physical machine, the behavior is fine with kubernetes 1.31, the Pod IP is removed from endPoints field on kubectl describe svc nginx when readinessProbe faill (as mentionned in doc "If the readiness probe returns a failed state, Kubernetes removes the pod from all matching service endpoints." ).
On VM, the Pod IP remains visible in endPoints field on kubectl describe svc nginx when readinessProbe faill.

An explanation ?

aojea · 2024-08-31T16:03:05Z

Is the kubectl the same version?

What difference make for kind running on top of virtual box or a VM, it just used docker container?

Are you doing something out of the ordinary? Adding custom nodes or different kind configuration?

bob2204 · 2024-08-31T16:20:09Z

Kubectl is the same version
The two install are identical.
The both have the same Calico CNI version 3.28.
In both installs there is Docker.
The only difference is Physical Machine/Virtual Machine.

BenTheElder · 2024-09-03T17:12:10Z

Do you observe this without calico? We don't really provide support for third party CNI (it's supported to be possible to install it, but we're not tracking down bugs with all of them).

bob2204 · 2024-09-03T17:23:46Z

With calico/cilium/kindnet i've the same behavior
With Virtualbox/VmWare/kvm the same.
With killercoda everything is fine ! For me it's like a witness.

I've tried this simple

kind create cluster --config=config.yml

with one Control-Plane and three Workers.

aojea · 2024-09-03T17:37:00Z

can you upload a tarball with the logs of the cluster that has the issue with kind export logs and indicate the name of the Service and the time (more or less) when the problem happens?

bob2204 · 2024-09-03T18:52:00Z

full-logs.tar.gz
Service name: nginx
UTC Time: 2024-09-03T18:43:56Z

bob2204 · 2024-09-03T18:53:40Z

Manifest used

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        ports:
          - name: http
            containerPort: 80
        readinessProbe:
          httpGet:
            path: /healthz
            port: http
          periodSeconds: 2
          failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: http
  selector:
    app: nginx

Alternatively I create/destroy /usr/share/nginx/html/healthz to act on ReadinessProbe.

aojea · 2024-09-05T11:11:46Z

full-logs.tar.gz Service name: nginx UTC Time: 2024-09-03T18:43:56Z

that does not adds up, the ngninx container starts at 18:44

Sep 03 18:44:01 stage-worker2 containerd[185]: time="2024-09-03T18:44:01.642418279Z" level=info msg="StartContainer for "0f8fa2821ddca5ce36b9ee686d36e60cf6ffa18b665585c663fe9f4baef699d0" returns successfully"

and there is no more logs after that, you have period 2 and threshold 2, so it should start failing at 18:44:05 but there are no logs there

I noticed that your environment has only 2 GB of ram in the VM, it would not be surprising that the problem is that your VMs are constrained and everything is slower on that environment

bob2204 · 2024-09-05T12:01:58Z

I'm so sorry to waste your time, but the problem remains the same with 8GB !
This is the new dump.
full-log-2.tar.gz

The time was around 11:40/11:50 UTC.

k describe ep,svc nginx 
Name:         nginx
Namespace:    default
Labels:       <none>
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2024-09-05T11:48:23Z
Subsets:
  Addresses:          <none>
  NotReadyAddresses:  10.244.2.3       <<<< This shows that the IP is not Ready 
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    <unset>  80    TCP

Events:  <none>


Name:                     nginx
Namespace:                default
Labels:                   <none>
Annotations:              <none>
Selector:                 app=nginx
Type:                     ClusterIP
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.96.88.94
IPs:                      10.96.88.94
Port:                     <unset>  80/TCP
TargetPort:               http/TCP
Endpoints:                10.244.2.3:80         <<<< Should NOT be here because the IP is not Ready
Session Affinity:         None
Internal Traffic Policy:  Cluster
Events:                   <none>

aojea · 2024-09-05T20:46:44Z

@bob2204 is like the kubelet is continuously restarting ... if you have the cluster running can you verify that?

bob2204 · 2024-09-06T02:31:46Z

None of the three kubelets is continuously restarting.
This the log of systemctl status kubelet of one node. The others are the same :

root@stage-worker2:/# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf, 11-kind.conf
     Active: active (running) since Thu 2024-09-05 11:44:26 UTC; 14h ago
       Docs: http://kubernetes.io/docs/
    Process: 197 ExecStartPre=/bin/sh -euc if [ -f /sys/fs/cgroup/cgroup.controllers ]; then /kind/bin/create-kubelet-cgroup-v2.sh; fi (code=exited, status=0/SUCCESS)
    Process: 198 ExecStartPre=/bin/sh -euc if [ ! -f /sys/fs/cgroup/cgroup.controllers ] && [ ! -d /sys/fs/cgroup/systemd/kubelet ]; then mkdir -p /sys/fs/cgroup/systemd/kubelet; fi (code=exited, status=0/SUCCESS)
   Main PID: 199 (kubelet)
      Tasks: 12 (limit: 9425)
     Memory: 43.2M
        CPU: 7min 5.993s
     CGroup: /kubelet.slice/kubelet.service
             └─199 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///run/containerd/containerd.sock --node-ip=172.18.0.3 --node-labels= --pod-infra-container-image=registry.k8s.io/pause:3.10 --provider-id=kind://docker/stage/stage-worker2 --runtime-cgroups=/system.slice/containerd.service

bob2204 changed the title ~~Node IP not removed from Service EndPoint when ReadunessProbe failed~~ Node IP not removed from Service EndPoint when ReadinessProbe failed Aug 30, 2024

bob2204 changed the title ~~Node IP not removed from Service EndPoint when ReadinessProbe failed~~ Pod IP not removed from Service EndPoint when ReadinessProbe failed Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

bob2204 commented Aug 30, 2024 •

edited

Loading

aojea commented Aug 30, 2024

bob2204 commented Aug 30, 2024 •

edited

Loading

aojea commented Aug 31, 2024

bob2204 commented Aug 31, 2024 •

edited

Loading

aojea commented Aug 31, 2024

bob2204 commented Aug 31, 2024

BenTheElder commented Sep 3, 2024

bob2204 commented Sep 3, 2024

aojea commented Sep 3, 2024

bob2204 commented Sep 3, 2024

bob2204 commented Sep 3, 2024

aojea commented Sep 5, 2024

bob2204 commented Sep 5, 2024 •

edited

Loading

aojea commented Sep 5, 2024

bob2204 commented Sep 6, 2024

Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

Comments

bob2204 commented Aug 30, 2024 • edited Loading

aojea commented Aug 30, 2024

bob2204 commented Aug 30, 2024 • edited Loading

aojea commented Aug 31, 2024

bob2204 commented Aug 31, 2024 • edited Loading

aojea commented Aug 31, 2024

bob2204 commented Aug 31, 2024

BenTheElder commented Sep 3, 2024

bob2204 commented Sep 3, 2024

aojea commented Sep 3, 2024

bob2204 commented Sep 3, 2024

bob2204 commented Sep 3, 2024

aojea commented Sep 5, 2024

bob2204 commented Sep 5, 2024 • edited Loading

aojea commented Sep 5, 2024

bob2204 commented Sep 6, 2024

bob2204 commented Aug 30, 2024 •

edited

Loading

bob2204 commented Aug 30, 2024 •

edited

Loading

bob2204 commented Aug 31, 2024 •

edited

Loading

bob2204 commented Sep 5, 2024 •

edited

Loading