[Bug]: Unable to use the load balancer in production during a rolling update #9902
Replies: 4 comments 13 replies
-
I forgot to attach the logs. These are the errors that appear on the consumer when updating brokers. Errors occur precisely when updating the second broker, as I described above, at this point it turns out that two brokers are unavailable. Logs from Alibaba ACK with listener = load balancer:
|
Beta Was this translation helpful? Give feedback.
-
I not sure I understand what the supposed issue is, as you seem to be mixing many different things together:
Each of these are completely different solutions and only one of them is really part of Strimzi. If there is any problem, you will need to explain it in a way where it is understandable to people who do not know your environment. |
Beta Was this translation helpful? Give feedback.
-
I run into this today ... if your problem is for whatever reason as you describe it, have you considered / tried to use something like this? https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.1/deploy/pod_readiness_gate/ |
Beta Was this translation helpful? Give feedback.
-
Thank You @scholzj and @Alexander-Volk. I have the same set up and @Alexander-Volk your solution seems to be working. Kafka broker pod does not become healthy until it is healthy in AWS target group. |
Beta Was this translation helpful? Give feedback.
-
Bug Description
With a high probability, this is a bug, since the solutions described in the original article do not work and exclude one of the main features of kafka - that this is a production solution and the update of any components can occur without downtime and data loss. However, when using a solution with load balancer, there is data loss and downtime in the work of kafka brokers.
I will describe the process: we have a private network in AWS and have deployed an EKS cluster on this private network. We use the CNI plugin, so the pods in the cluster are accessible by IP addresses from our private network. This gives us the opportunity to publish our applications using Load Balancer and Cluster IP k8s service. To link the load balancer and the service, the Target group (with required health checks) and the TargetGroupBinding resource are used to link the k8s service to the aws target group. And this solution allows you to use one NLB for bootstrap service and for brokers, which allows you not to produce unnecessary NLB instances. I would treat this solution as a more advanced solution that branches off from yours using a single NLB And NLB allows you to use TLS termination. Our solution is more automated without manual steps.
The solution is working and shows itself perfectly as long as we do not add an uncontrolled rolling update. Our solution currently uses
3 brokers
and a topic with the configurationreplicas: 3
andmin.insync.replicas: 2
. After the deployment of the strimzi kafka cluster, everything works until the moment when we need to update the brokers. When any configuration change is made that affects the brokers state, a rolling update is launched, which usually completes successfully. When restarting the first broker, the solution still works because the configuration allows you to lose one of the brokers. Next, the first broker is restored and the second broker starts restarting almost immediately. And that's when the mistakes start. After restoring the first broker, in fact, there is no connection yet, and when the second broker starts updating, it turns out that two brokers are unavailable at once, but strimzi thinks otherwise, hence the errors. For some reason, the strimzi operator updates brokers very, very quickly, which prevents health checks on load balancers from passing before the second broker is restartedWe also have a solution based on the use of Load Balancer k8s service, which creates NLB for each broker and for bootstrap service. In fact, this is exactly the solution described here, but in the Alibaba ACK cluster and using Alibaba NLB. Everything happens exactly the same way - when updating brokers, when the first broker is updated, everything is OK, after restoring the first broker and starting updating the second broker, data loss occurs
Steps to reproduce
Expected behavior
Rolling update works correctly and brokers are updated taking into account the load balancer health checks. There are no errors in the logs and no data loss. The messages are in the correct order
Strimzi version
0.39.0
Kubernetes version
v1.28.6-eks-508b6b3, v1.28.3-aliyun.1
Installation method
Helm chart + ArgoCD
Infrastructure
AWS EKS, Alibaba ACK
Configuration files and logs
For AWS EKS cluster:
Kafka cluster:
Kafka topic:
TargetGroupBinding - it is created for bootstrap service and for each broker
For Alibaba ACK cluster:
Kafka Cluster:
The topic is the same as for aws. TargetGroupBinding is not needed because listener = LoadBalancer
Additional context
It seems like it should work out of the box, since kafka is a production solution. A solution could be the possibility of manual updating for a specific indication for which pods or delayed updating of subsequent pods. Perhaps this function will help
Beta Was this translation helpful? Give feedback.
All reactions