Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Konnectivity server leaks memory and free sockets #255

Open
Tamas-Biro1 opened this issue Aug 10, 2021 · 26 comments
Open

Konnectivity server leaks memory and free sockets #255

Tamas-Biro1 opened this issue Aug 10, 2021 · 26 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@Tamas-Biro1
Copy link
Contributor

Hello,

We are using ANP version 0.0.21 with a patch #179 (I know it is already in 0.0.22, but currently it is not an option to upgrade)

We have a cluster with 1 ANP server and 4 ANP agents. Sometimes we see that the server starts consuming lot of memory and once it reaches the limit then restarts, and the whole thing starts from the beginning. This is true also for the number of free sockets. Here is the prometheus graph for memory and socket numbers:

image

image

From the server logs I only see the "usual" things could be seen in previous versions:

I0806 18:25:02.385941       1 server.go:736] "Close streaming" agentID="94a18c0b-efdf-476b-a8fa-10e0c0057345" connectionID=57
I0806 18:25:02.388576       1 server.go:723] "Received CLOSE_RSP" connectionID=85
E0806 18:25:02.388662       1 server.go:731] "CLOSE_RSP send to client stream error" err="tls: use of closed connection" connectionID=85
I0806 18:25:02.388764       1 server.go:283] "Remove frontend for agent" frontend=&{Mode:http-connect Grpc:<nil> HTTP:0xc001ff1c00 connected:0xc0000afbc0 connectID:85 agentID:1c6f546a-3e02-46fc-a061-dcb014d7969e start:{wall:13851803065787238258 ext:26239485524435 loc:0x218a780} backend:0xc000952fa0} agentID="1c6f546a-3e02-46fc-a061-dcb014d7969e" connectionID=85
I0806 18:25:02.388843       1 server.go:736] "Close streaming" agentID="1c6f546a-3e02-46fc-a061-dcb014d7969e" connectionID=85

and

I0806 18:25:02.428722       1 server.go:680] "Received DIAL_RSP" random=2394930260326768039 agentID="1c6f546a-3e02-46fc-a061-dcb014d7969e" connectionID=91
I0806 18:25:02.428840       1 server.go:262] "Register frontend for agent" frontend=&{Mode:http-connect Grpc:<nil> HTTP:0xc000d33500 connected:0xc0002f16e0 connectID:91 agentID:1c6f546a-3e02-46fc-a061-dcb014d7969e start:{wall:13851803065793093486 ext:26239491382902 loc:0x218a780} backend:0xc000952fa0} agentID="1c6f546a-3e02-46fc-a061-dcb014d7969e" connectionID=91
I0806 18:25:02.428960       1 server.go:709] "Received data from agent" bytes=840 agentID="1c6f546a-3e02-46fc-a061-dcb014d7969e" connectionID=79
I0806 18:25:02.429150       1 server.go:718] "DATA sent to frontend"
I0806 18:25:02.429187       1 tunnel.go:121] "Starting proxy to host" host="172.17.57.44:9090"
...
I0806 18:25:02.447214       1 tunnel.go:161] "Stopping transfer to host" host="172.17.57.44:9090" agentID="94a18c0b-efdf-476b-a8fa-10e0c0057345" connectionID=61
I0806 18:25:02.450691       1 server.go:723] "Received CLOSE_RSP" connectionID=91
E0806 18:25:02.450768       1 server.go:731] "CLOSE_RSP send to client stream error" err="tls: use of closed connection" connectionID=91
I0806 18:25:02.450834       1 server.go:283] "Remove frontend for agent" frontend=&{Mode:http-connect Grpc:<nil> HTTP:0xc000d33500 connected:0xc0002f16e0 connectID:91 agentID:1c6f546a-3e02-46fc-a061-dcb014d7969e start:{wall:13851803065793093486 ext:26239491382902 loc:0x218a780} backend:0xc000952fa0} agentID="1c6f546a-3e02-46fc-a061-dcb014d7969e" connectionID=91
I0806 18:25:02.450866       1 server.go:736] "Close streaming" agentID="1c6f546a-3e02-46fc-a061-dcb014d7969e" connectionID=91

Did you face this problem? Or any idea what should be changed / investigated? Thanks!

@cheftako
Copy link
Contributor

One issue I'm following up on right now has to do with losing connections during a dial timeout. The most obvious symptom would be lines in the server log containing the string connectionID=0. Do you see this? (I have a fix for GRPC checked in but I think a bit extra needs to be done for http-connect which I believe is what you are using)

@cheftako
Copy link
Contributor

It would also be interesting to see what pprof says as your ramping up on memory and sockets.

@Tamas-Biro1
Copy link
Contributor Author

@cheftako yes we are using http-connect. I cannot see this connectionID=0 in server log at all. I can only access the logs from the server, but will try to reproduce it in my acc to run pprof and will get back the results to you. Could link the PR you are referring that fixes for GRPC? thank you

@cheftako
Copy link
Contributor

#253. I think the handling of client.PacketType_DIAL_CLS should also close the http connection but I wanted to get more testing before adding that change. However if your not seeing connectionID=0 then I think you seeing a different issue.

@mihivagyok
Copy link
Contributor

Hi!

It is possible that there were some networking issue between the konnectivity server and agent, but the traffic was high.
I see this PR: #270

It is possible we have found this issue. @Tamas-Biro1 I think based on this, we could try to simulate this somehow. What do you think?

Thank you!
Adam

@mihivagyok
Copy link
Contributor

Okay, after some investigaton, I see the following:

  • this happens when we have one agent, but it is not replying for anything (the agent is hang)
  • then when thousands of requests are going to the proxy, the konnectivity server opens up new connections, but never ends those and it reaches its memory limit

Configuration:

  • konnectivity-server
      - --mode=http-connect
      - --proxy-strategies=destHost,default
  • konnectivity-agent
    - --agent-identifiers=ipv4=10.211.145.18

How do I make the agent hang? I use my kubectl cp test described here: #261

How do I make request to the tunnel? I'm executing kubectl logs for a container continously (20000 times), but it fails as the only one agent is hanging.

In the konnectivity server logs, the following is repeating:

I1020 15:25:09.353162       1 tunnel.go:42] "Received request for host" method="CONNECT" host="10.211.145.183:10250" userAgent=""
I1020 15:25:09.353217       1 tunnel.go:44] "TLS" commonName="konnectivity-client"
I1020 15:25:09.353558       1 tunnel.go:78] Set pending(rand=4691100396281938418) to &{0xc01b57e640 0xc01b345600 {} 0x5253a0 true false false false 0 {0 0} <nil> {0xc01b5f88c0 map[] true true} map[] false 0 -1 200 false false [] 0 [87 101 100 44 32 50 48 32 79 99 116 32 50 48 50 49 32 49 53 58 50 53 58 48 57 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [50 48 48] 0xc00f3cda40 0}
I1020 15:25:09.353631       1 desthost_backend_manager.go:49] "Get the backend through the DestHostBackendManager" destHost="10.211.145.183"
I1020 15:25:09.599683       1 tunnel.go:42] "Received request for host" method="CONNECT" host="10.211.145.183:10250" userAgent=""
I1020 15:25:09.599731       1 tunnel.go:44] "TLS" commonName="konnectivity-client"
I1020 15:25:09.600001       1 tunnel.go:78] Set pending(rand=429742530032762963) to &{0xc01aff4e60 0xc01abfb600 {} 0x5253a0 true false false false 0 {0 0} <nil> {0xc01b1fb7a0 map[] true true} map[] false 0 -1 200 false false [] 0 [87 101 100 44 32 50 48 32 79 99 116 32 50 48 50 49 32 49 53 58 50 53 58 48 57 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [50 48 48] 0xc00af681c0 0}
I1020 15:25:09.600094       1 desthost_backend_manager.go:49] "Get the backend through the DestHostBackendManager" destHost="10.211.145.183"
I1020 15:25:09.733114       1 tunnel.go:42] "Received request for host" method="CONNECT" host="10.211.145.183:10250" userAgent=""
I1020 15:25:09.733165       1 tunnel.go:44] "TLS" commonName="konnectivity-client"
I1020 15:25:09.733400       1 tunnel.go:78] Set pending(rand=136427893707070941) to &{0xc01adf8c80 0xc01b345800 {} 0x5253a0 true false false false 0 {0 0} <nil> {0xc01b5f8d20 map[] true true} map[] false 0 -1 200 false false [] 0 [87 101 100 44 32 50 48 32 79 99 116 32 50 48 50 49 32 49 53 58 50 53 58 48 57 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [50 48 48] 0xc00f806e00 0}
I1020 15:25:09.733458       1 desthost_backend_manager.go:49] "Get the backend through the DestHostBackendManager" destHost="10.211.145.183"

Leak is visible in prometheus:
image

Memory is increasing similarly.

What do you think? @cheftako @jkh52 @caesarxuchao

Thank you!
Adam

@mihivagyok
Copy link
Contributor

In the test I'm using a container built from the main branch. Thanks.

@mihivagyok
Copy link
Contributor

In the agent, the log is still rolling:

I1020 16:01:29.876054       1 client.go:508] "write to remote" connectionID=10 lastData=35
I1020 16:01:31.697604       1 client.go:433] "received DATA" connectionID=387
I1020 16:01:31.697847       1 client.go:508] "write to remote" connectionID=387 lastData=39
I1020 16:01:31.698703       1 client.go:433] "received DATA" connectionID=388
I1020 16:01:31.698812       1 client.go:508] "write to remote" connectionID=388 lastData=39
I1020 16:01:35.342321       1 client.go:433] "received DATA" connectionID=10
I1020 16:01:35.342377       1 client.go:433] "received DATA" connectionID=10
I1020 16:01:35.342406       1 client.go:433] "received DATA" connectionID=10
I1020 16:01:35.342429       1 client.go:433] "received DATA" connectionID=10
I1020 16:01:35.342523       1 client.go:508] "write to remote" connectionID=10 lastData=39
I1020 16:01:35.342548       1 client.go:508] "write to remote" connectionID=10 lastData=39
I1020 16:01:35.342587       1 client.go:508] "write to remote" connectionID=10 lastData=39
I1020 16:01:35.342615       1 client.go:508] "write to remote" connectionID=10 lastData=39
I1020 16:01:35.344165       1 client.go:433] "received DATA" connectionID=10
I1020 16:01:35.344225       1 client.go:508] "write to remote" connectionID=10 lastData=39
I1020 16:01:35.378570       1 client.go:433] "received DATA" connectionID=10
I1020 16:01:35.378620       1 client.go:433] "received DATA" connectionID=10
I1020 16:01:35.378662       1 client.go:508] "write to remote" connectionID=10 lastData=35
I1020 16:01:35.378697       1 client.go:508] "write to remote" connectionID=10 lastData=35
I1020 16:01:35.395529       1 client.go:433] "received DATA" connectionID=10
I1020 16:01:35.395610       1 client.go:508] "write to remote" connectionID=10 lastData=35
I1020 16:01:35.396069       1 client.go:433] "received DATA" connectionID=10
I1020 16:01:35.396251       1 client.go:433] "received DATA" connectionID=10
I1020 16:01:35.396360       1 client.go:508] "write to remote" connectionID=10 lastData=35
I1020 16:01:35.396408       1 client.go:508] "write to remote" connectionID=10 lastData=35
I1020 16:01:38.156728       1 client.go:433] "received DATA" connectionID=389
I1020 16:01:38.156886       1 client.go:508] "write to remote" connectionID=389 lastData=46
I1020 16:01:38.235450       1 client.go:433] "received DATA" connectionID=389
I1020 16:01:38.235591       1 client.go:508] "write to remote" connectionID=389 lastData=35

Once the kubectl log provocation has stopped, I see this in the server log:

I1020 15:59:27.810075       1 tunnel.go:42] "Received request for host" method="CONNECT" host="172.21.144.88:443" userAgent=""
I1020 15:59:27.810122       1 tunnel.go:44] "TLS" commonName="konnectivity-client"
I1020 15:59:27.810268       1 tunnel.go:78] Set pending(rand=6869199264244998214) to &{0xc03759a000 0xc037758300 {} 0x5253a0 true false false false 0 {0 0} <nil> {0xc037726380 map[] true true} map[] false 0 -1 200 false false [] 0 [87 101 100 44 32 50 48 32 79 99 116 32 50 48 50 49 32 49 53 58 53 57 58 50 55 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [50 48 48] 0xc0358d1490 0}
I1020 15:59:27.810504       1 desthost_backend_manager.go:49] "Get the backend through the DestHostBackendManager" destHost="172.21.144.88"
I1020 15:59:27.844858       1 server.go:712] "Received DIAL_RSP" random=6869199264244998214 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.845011       1 server.go:269] "Register frontend for agent" frontend=&{Mode:http-connect Grpc:<nil> HTTP:0xc008bd8a80 CloseHTTP:0x143b720 connected:0xc0372c29c0 connectID:389 agentID:f562723e-8bc8-42cf-8c48-8ace8f1d91a1 start:{wall:13858751534165112064 ext:5214151915214 loc:0x223cee0} backend:0xc0003dc750} agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.845076       1 tunnel.go:135] "Starting proxy to host" host="172.21.144.88:443"
I1020 15:59:27.845182       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=296 totalBytes=296 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.881415       1 server.go:742] "Received data from agent" bytes=2639 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.881578       1 server.go:751] "DATA sent to frontend"
I1020 15:59:27.888126       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=2749 totalBytes=3045 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.888191       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=86 totalBytes=3131 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.888234       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=438 totalBytes=3569 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.924067       1 server.go:742] "Received data from agent" bytes=99 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.924206       1 server.go:751] "DATA sent to frontend"
I1020 15:59:27.926807       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=31 totalBytes=3600 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.965093       1 server.go:742] "Received data from agent" bytes=102 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.965253       1 server.go:751] "DATA sent to frontend"
I1020 15:59:27.965283       1 server.go:742] "Received data from agent" bytes=176 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=389
I1020 15:59:27.965732       1 server.go:751] "DATA sent to frontend"
I1020 15:59:31.444195       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=39 totalBytes=4625 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=387
I1020 15:59:31.471590       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=39 totalBytes=3713 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=388
I1020 15:59:31.477779       1 server.go:742] "Received data from agent" bytes=39 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=387
I1020 15:59:31.477904       1 server.go:751] "DATA sent to frontend"
I1020 15:59:31.505632       1 server.go:742] "Received data from agent" bytes=39 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=388
I1020 15:59:31.505849       1 server.go:751] "DATA sent to frontend"
I1020 15:59:35.320630       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=39 totalBytes=83620 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=10
I1020 15:59:35.320696       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=39 totalBytes=83659 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=10
I1020 15:59:35.320724       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=39 totalBytes=83698 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=10
I1020 15:59:35.321008       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=39 totalBytes=83737 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=10
I1020 15:59:35.321311       1 tunnel.go:168] "Forwarding data on tunnel to agent" bytes=39 totalBytes=83776 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=10
I1020 15:59:35.356494       1 server.go:742] "Received data from agent" bytes=59 agentID="f562723e-8bc8-42cf-8c48-8ace8f1d91a1" connectionID=10
I1020 15:59:35.356633       1 server.go:751] "DATA sent to frontend"

This query is for the metrics server inside the cluster. Seem kubectl top is still working.

Thanks!

@mihivagyok
Copy link
Contributor

So to summerize: seems agent could open connection/can proxy to cluster service, but not for kubelet?
As I'm reading the code, I don't see any possible reason for that.

Thank you!
Adam

@relyt0925
Copy link
Contributor

Note this is still occuring on the latest master (although not as frequently): Here's an example of one pod that has leaked to 12 Gigs server side with logs
ks-logs.txt.zip

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 27ceee3a2676bca3248e6d0dc5e0f4bce5cf46888f7778303d8856f80159a810
    cni.projectcalico.org/podIP: 172.30.94.13/32
    cni.projectcalico.org/podIPs: 172.30.94.13/32
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "k8s-pod-network",
          "ips": [
              "172.30.94.13"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "k8s-pod-network",
          "ips": [
              "172.30.94.13"
          ],
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: restricted
  creationTimestamp: "2021-11-08T19:03:51Z"
  generateName: konnectivity-server-6b9dc5df9b-
  labels:
    app: konnectivity-server
    hypershift.openshift.io/control-plane-component: konnectivity-server
    hypershift.openshift.io/hosted-control-plane: master-rgshc3-1-3
    pod-template-hash: 6b9dc5df9b
  name: konnectivity-server-6b9dc5df9b-25zfm
  namespace: master-rgshc3-1-3
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: konnectivity-server-6b9dc5df9b
    uid: bc238276-0488-467a-97e6-efa0780832e8
  resourceVersion: "27486240"
  uid: 15304030-ca9d-49cd-9e39-d76c583782b6
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: hypershift.openshift.io/control-plane
            operator: In
            values:
            - "true"
        weight: 50
      - preference:
          matchExpressions:
          - key: hypershift.openshift.io/cluster
            operator: In
            values:
            - master-rgshc3-1-3
        weight: 100
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              hypershift.openshift.io/hosted-control-plane: master-rgshc3-1-3
          topologyKey: kubernetes.io/hostname
        weight: 100
  automountServiceAccountToken: false
  containers:
  - args:
    - --logtostderr=true
    - --log-file-max-size=0
    - --cluster-cert
    - /etc/konnectivity/cluster/tls.crt
    - --cluster-key
    - /etc/konnectivity/cluster/tls.key
    - --server-cert
    - /etc/konnectivity/server/tls.crt
    - --server-key
    - /etc/konnectivity/server/tls.key
    - --server-ca-cert
    - /etc/konnectivity/server/ca.crt
    - --server-port
    - "8090"
    - --agent-port
    - "8091"
    - --health-port
    - "2041"
    - --admin-port=8093
    - --mode=http-connect
    - --proxy-strategies=destHost,defaultRoute
    command:
    - /usr/bin/proxy-server
    image: registry.ng.bluemix.net/armada-master/rh-apiserver-network-proxy:332d57fea23de99c3cbe58de36623866d8819bde
    imagePullPolicy: Always
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: healthz
        port: 2041
        scheme: HTTP
      initialDelaySeconds: 120
      periodSeconds: 60
      successThreshold: 1
      timeoutSeconds: 30
    name: konnectivity-server
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1001060000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/konnectivity/cluster
      name: cluster-certs
    - mountPath: /etc/konnectivity/server
      name: server-certs
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: default-dockercfg-wdn9m
  - name: pull-secret
  nodeName: 10.93.93.190
  preemptionPolicy: PreemptLowerPriority
  priority: 100000000
  priorityClassName: hypershift-control-plane
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1001060000
    seLinuxOptions:
      level: s0:c33,c2
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: hypershift.openshift.io/control-plane
    operator: Equal
    value: "true"
  - effect: NoSchedule
    key: hypershift.openshift.io/cluster
    operator: Equal
    value: master-rgshc3-1-3
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: server-certs
    secret:
      defaultMode: 420
      secretName: konnectivity-server
  - name: cluster-certs
    secret:
      defaultMode: 420
      secretName: konnectivity-cluster
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-11-08T19:03:51Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-11-08T19:04:07Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-11-08T19:04:07Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-11-08T19:03:51Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://cb9ab8a0b1b9dccfa896724b2f391d9fea1545378788d54a74a6ec7da0d9e5e7
    image: registry.ng.bluemix.net/armada-master/rh-apiserver-network-proxy:332d57fea23de99c3cbe58de36623866d8819bde
    imageID: registry.ng.bluemix.net/armada-master/rh-apiserver-network-proxy@sha256:ad8a3787c5adef59e99346d82a3db7d235f3951f86ee3f3d0b2e47eb11ac2b1b
    lastState: {}
    name: konnectivity-server
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-11-08T19:04:06Z"
  hostIP: 10.93.93.190
  phase: Running
  podIP: 172.30.94.13
  podIPs:
  - ip: 172.30.94.13
  qosClass: Burstable
  startTime: "2021-11-08T19:03:51Z"

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 7, 2022
@andrewsykim
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 7, 2022
@andrewsykim
Copy link
Member

@relyt0925 which commit is your image based off of, it's hard to tell from the manifest you shared because it's just the image SHA

@andrewsykim
Copy link
Member

Reposting from #276 (comment):

An interesting observation I noticed while trying to reproduce this issue is that restarting Konnectivity Server will not completely clear the open files and goroutines. So if I restart just proxy server and immediately check the goroutine and open files count, it's a bit lower but still very high. But if I restart kube-apiserver, I see the goroutines and open files reset across both components. So at this point I'm thinking this is mainly a leak in konnectivity-client.

Going to run some validation against #325 today which has some potential fixes in konnectivity-client but would appreciate it if anyone else that is able to reproduce the issue can also test it. Note that the fix does require a rebuild of kube-apiserver so it might be difficult to test it

@andrewsykim
Copy link
Member

@Tamas-Biro1 in your cluster do you also see kube-apiserver memory spike in a similar pattern or is it only ANP server/agent?

@mihivagyok
Copy link
Contributor

@andrewsykim We are using Konnectivity with http-connect configuration, and NOT GRPC mode. I think the konnectivity-client is only for GRPC, is it right?

Of course, we could check, but then please share the commands you are interested and make it comparable.

Thanks,
Adam

@andrewsykim
Copy link
Member

@mihivagyok you're right, I got this mixed up wtih #276 which was using gRPC + UDS. They seem like separte issues with similar symptoms

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 18, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Tamas-Biro1
Copy link
Contributor Author

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jul 13, 2023
@k8s-ci-robot
Copy link
Contributor

@Tamas-Biro1: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Tamas-Biro1
Copy link
Contributor Author

hey folks,

we still have this issue with the Konnectivity version v0.1.2

container_memory_usage_bytes:
image

leaking container_sockets:
image

do you have idea or suggestion where to start debugging?

@jkh52
Copy link
Contributor

jkh52 commented Jul 17, 2023

What does your EgressSelectorConfiguration look like?

With 0.1.2, using grpc mode, we do not see sawtooth like leaks like earlier versions. So there could be remaining leaks in http-connect specific codepaths. @ipochi have you seen this as well, with recent binaries?

do you have idea or suggestion where to start debugging?

The logs above (from 2021) must be from an older version, do you have a logs sample from 0.1.2? Also I would check the metrics pending_backend_dials and established_connections. If those counters are stable / low, then it will help rule out agent interactions and keep focus on apiserver <-> proxy-server segment (as I suspect based on sockets count).

@jkh52 jkh52 added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 17, 2023
@ipochi
Copy link
Contributor

ipochi commented Aug 11, 2023

@Tamas-Biro1 @cheftako @jkh52

Recently I came across a scenario in which I noticed a similar spike in http-connect mode and proxy-server v0.0.33 getting OOMKilled

image

Number of pending dials were high (500+, don't have an exact number).

Whats peculiar/interesting to me is that after the first OOMKill of the proxy-server, the new container was already occupying 75% of memory at the time of OOMKill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants