Skip to content

frontend failed to add healthy backend in minikube k8s cluster #21404

Description

@huandzh

Steps to reproduce the behavior (Required)

  1. start a minikube cluster

  2. deploy with demo.yaml provided in this issue report. It creates deployments and services and adds the be to the fe.

demo.zip

kubectl apply -f demo.yaml
stateDiagram
  state MiniKubeCluster {
     fe_deployment --> fe_service
     be_deployment --> be_service
     be_service --> fe_service
   }
 fe_service --> ExternalApp
Loading
  1. try connecting to fe, or check error logs of fe and be

Expected behavior (Required)

except frontend to:

  • Allow saved IP different than backend localhost as long as the backend is reachable and healthy
  • or allow using hostnames to access backends
  • or provide config to skip address validation, since there are many ips, such as Cluster IP\Node IP\External IP, pointing to the same workload in k8s

Real behavior (Required)

  1. frontend correctly save Cluster IP when backend is added with hostname as expected
  2. but validate the saved address with then localhost of backend, which is Endpoint Node IP (generally randomly assigned by k8s except create endpoints or endpointSlice Kind)
  3. the backend can't be activated

be is healthy, but cannot be added :

image

FE LOG(from 8030 site):

2023-04-12 05:22:30,697 WARN (heartbeat mgr|31) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: FE saved address not match backend address
2023-04-12 05:22:32,524 WARN (tablet stat mgr|25) [TabletStatMgr.updateLocalTabletStat():129] task exec error. backend[10002]
org.apache.thrift.transport.TTransportException: Invalid port -1
	at org.apache.thrift.transport.TSocket.open(TSocket.java:213) ~[libthrift-0.13.0.jar:0.13.0]
	at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:131) ~[starrocks-fe.jar:?]
	at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:116) ~[starrocks-fe.jar:?]
	at org.apache.commons.pool2.BaseKeyedPooledObjectFactory.makeObject(BaseKeyedPooledObjectFactory.java:62) ~[commons-pool2-2.3.jar:2.3]
	at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:1036) ~[commons-pool2-2.3.jar:2.3]
	at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:356) ~[commons-pool2-2.3.jar:2.3]
	at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:278) ~[commons-pool2-2.3.jar:2.3]
	at com.starrocks.common.GenericPool.borrowObject(GenericPool.java:88) ~[starrocks-fe.jar:?]
	at com.starrocks.catalog.TabletStatMgr.updateLocalTabletStat(TabletStatMgr.java:121) [starrocks-fe.jar:?]
	at com.starrocks.catalog.TabletStatMgr.runAfterCatalogReady(TabletStatMgr.java:71) [starrocks-fe.jar:?]
	at com.starrocks.common.util.LeaderDaemon.runOneCycle(LeaderDaemon.java:60) [starrocks-fe.jar:?]
	at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]

BE_LOG(from be.WARNING in be pod):

W0412 03:15:05.854683  1184 heartbeat_server.cpp:148] 10.107.157.178 not equal to to backend localhost 10.244.0.73

connect error(show backends; in sql):

SQL Error (1064): Backend not found. Check if any backend is down or not. backend: [10.107.157.178 alive: false inBlacklist: false]

further details

  • the localhost ip of the backend node is endpoint IP:

image

ifconfig info:

image

  • the service before the node is cluster IP, and is also where the hostname pointing to:

image

image

codes may cause the issue:

if (master_info.__isset.backend_ip) {
if (master_info.backend_ip != BackendOptions::get_localhost()) {
LOG(WARNING) << master_info.backend_ip << " not equal to to backend localhost "
<< BackendOptions::get_localhost();
bool fe_saved_is_valid_ip = is_valid_ip(master_info.backend_ip);
if (fe_saved_is_valid_ip && is_valid_ip(BackendOptions::get_localhost())) {

attempted workarounds

  1. add an endpointslice kind to predefine a static endpoint ip to the service
  2. add backend with the static ip instead of the hostname

Result: sql client can query fe now with backend added and activated. But stream load api still not works when other services in the same minikube cluster call it.(could be another issue)

StarRocks version (Required)

2.5.4

minikube version

minikube version: v1.30.1(windows default installation)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions