Druid pods taking more time to come up when security enabled #301

bharathkuppala · 2022-08-03T11:35:54Z

Hello All,
druid-operator: v0.0.7
druid: v0.22.1

We are trying to enable basic security for druid and we found out that it was taking Approx. 16 min for entire cluster to come up. Upon investigation on logs we found out that there were multiple retries against coordinator trying to fetch some users data.

Logs from middleManager:
{"timeMillis":1653992838211,"thread":"main","level":"WARN","loggerName":"org.apache.druid.java.util.common.RetryUtils","message":"Retrying (1 of 9) in 727ms.","thrown":{"commonElementCount":0,"localizedMessage":"No content to map due to end-of-input\n at [Source: (byte[])\"\"; line: -1, column: 0]","message":"No content to map due to end-of-input\n at [Source: (byte[])\"\"; line: -1, column: 0]","name":"com.fasterxml.jackson.databind.exc.MismatchedInputException","extendedStackTrace":[{"class":"com.fasterxml.jackson.databind.exc.MismatchedInputException","method":"from","file":"MismatchedInputException.java","line":59,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"_initForReading","file":"ObjectMapper.java","line":4360,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"_readMapAndClose","file":"ObjectMapper.java","line":4205,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"readValue","file":"ObjectMapper.java","line":3292,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"tryFetchUserMapsFromCoordinator","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":400,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"lambda$fetchUserAndRoleMapFromCoordinator$4","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":330,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":129,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":81,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":163,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":153,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"fetchUserAndRoleMapFromCoordinator","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":328,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"initUserMaps","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":457,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"start","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":116,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"sun.reflect.NativeMethodAccessorImpl","method":"invoke0","file":"NativeMethodAccessorImpl.java","line":-2,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"sun.reflect.NativeMethodAccessorImpl","method":"invoke","file":"NativeMethodAccessorImpl.java","line":62,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"sun.reflect.DelegatingMethodAccessorImpl","method":"invoke","file":"DelegatingMethodAccessorImpl.java","line":43,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"java.lang.reflect.Method","method":"invoke","file":"Method.java","line":498,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler","method":"start","file":"Lifecycle.java","line":446,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.lifecycle.Lifecycle","method":"start","file":"Lifecycle.java","line":341,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.guice.LifecycleModule$2","method":"start","file":"LifecycleModule.java","line":143,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.GuiceRunnable","method":"initLifecycle","file":"GuiceRunnable.java","line":115,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.ServerRunnable","method":"run","file":"ServerRunnable.java","line":63,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.Main","method":"main","file":"Main.java","line":113,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"}]},"endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}

In the above log its looking for some method which would fetch some information from coordinator and would retry for 9 times.

Wanted to know does druid-operator impose any restrictions upon druid installation like order its broker, historicals and middlemanager which would comeup first during scratch installation then comes coordinator and router?

If above is true then pods will fail looking for coordinator?
Any help on this?

The text was updated successfully, but these errors were encountered:

bharathkuppala · 2022-08-17T06:38:18Z

@AdheipSingh : Can you please provide any info on this ? if you know already

AdheipSingh · 2022-08-17T10:37:43Z

sorry missed this out,
the druid operator has nothing do with security enabled on druid or not. Druid operator supports securityContexts on the kubernetes pods it deploys and manages.
When rolling deploy is enabled to true the druid operator upgrades the nodes in the order as described here
https://github.com/druid-io/druid-operator/blob/master/controllers/druid/handler.go#L1765
When the operator installs the cluster all pods are deployed parallel.

bharathkuppala · 2022-09-08T01:46:33Z

Thanks for the rly @AdheipSingh and sry for not responding actually we were testing this and found out that when we give all node types as a statefulset like you said all pods are deployed in parallel, But let's say in my case we are having
Deployments- Broker, Coordinator and Router
Statefulsets - Historicals and MiddleManagers

Now when we install we see along with 2 statefulsets only one deployment is getting up hence stopping other deployments from coming up.

Can you please check this behavior if not already done? Or any suggestions would be of a great help!!

Stack trace
2022-09-08T05:40:35.675Z ERROR controller-runtime.manager.controller.druid Reconciler error {"reconciler group": "druid.apache.org", "reconciler kind": "Druid", "name": "tsdb-apache-druid", "namespace": "nom-backend", "error": "StatefulSet.apps \"druid-tsdb-apache-druid-brokers\" not found"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /home/jenkins/workspace/NM/Neo/operators/druid-operator/druid-operator-druid-operator-0.0.8/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /home/jenkins/workspace/NM/Neo/operators/druid-operator/druid-operator-druid-operator-0.0.8/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
After above stack trace it would stop near broker preventing other deployments to come up
Thanks!!

bharathkuppala · 2022-09-08T06:49:51Z

Intermediately i see this issue after doing scratch install multiple times. But i always see above stack trace no sure why :(

AdheipSingh · 2022-09-08T08:22:32Z

@bharathkuppala https://github.com/druid-io/druid-operator/blob/master/controllers/druid/handler.go#L182

Be deployment or statefulset in any combination, all pods should come up in parallel for the 1st installation.

Can you tell me the generation number in the CR metadata, if its > 1 and rollingDeploy is kept to true in your CR, upgrades and changes will happen incremental.

Use Kind: Deployemnt to switch to deployment by default its statefulset

bharathkuppala · 2022-09-08T11:44:29Z

@AdheipSingh : Yes like you said there is no problem while doing the upgrade with generation number > 1 it would follow the prescribed order. But somehow i see with podMangamentPolicy as parallel both Historical and MM are deployed next broker would comeup stopping other deployments

bharathkuppala · 2022-09-09T02:07:36Z

We are trying our druid cluster in open shift. Just wanted to know do we have to check something on open shift side to troubleshoot this behavior?
Because in operator we saw it often hitting this line and giving else error.

druid-operator/controllers/druid/handler.go

Line 680 in 5abef01

if apierrors.IsNotFound(err) {

AdheipSingh · 2022-09-09T07:29:40Z

@bharathkuppala IMHO it should not arise, i will try to re-create this issue. Just to double check, did you switch between sts and deployments in Kind ?

bharathkuppala · 2022-09-09T10:45:44Z

@AdheipSingh : Yes i did switch but still i see this issue

bharathkuppala · 2022-09-09T10:51:03Z

druid-tsdb-apache-druid-historicals-0 0/1 druid-tsdb-apache-druid-historicals-0 0/1 druid-tsdb-apache-druid-middlemanagers-0 0/1 druid-tsdb-apache-druid-middlemanagers-0 0/1 druid-tsdb-apache-druid-historicals-1 0/1 druid-tsdb-apache-druid-historicals-1 0/1 druid-tsdb-apache-druid-middlemanagers-1 0/1 druid-tsdb-apache-druid-middlemanagers-1 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 druid-tsdb-apache-druid-historicals-0 0/1 druid-tsdb-apache-druid-middlemanagers-0 0/1 druid-tsdb-apache-druid-middlemanagers-1 0/1 druid-tsdb-apache-druid-historicals-1 0/1 druid-tsdb-apache-druid-middlemanagers-1 0/1 druid-tsdb-apache-druid-historicals-0 0/1 druid-tsdb-apache-druid-middlemanagers-0 0/1 druid-tsdb-apache-druid-historicals-0 0/1 druid-tsdb-apache-druid-middlemanagers-1 0/1 druid-tsdb-apache-druid-historicals-1 0/1 druid-tsdb-apache-druid-middlemanagers-0 0/1 druid-tsdb-apache-druid-historicals-1 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 druid-tsdb-apache-druid-historicals-0 0/1 druid-tsdb-apache-druid-middlemanagers-1 0/1 druid-tsdb-apache-druid-middlemanagers-0 0/1 druid-tsdb-apache-druid-historicals-1 0/1 druid-tsdb-apache-druid-historicals-1 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 druid-tsdb-apache-druid-middlemanagers-1 0/1 druid-tsdb-apache-druid-historicals-0 0/1 druid-tsdb-apache-druid-middlemanagers-0 0/1 druid-tsdb-apache-druid-historicals-0 1/1 druid-tsdb-apache-druid-middlemanagers-1 1/1 druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 1/1 druid-tsdb-apache-druid-brokers-86c4849b47-4598x 1/1 druid-tsdb-apache-druid-middlemanagers-0 1/1 druid-tsdb-apache-druid-historicals-1 1/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 1/1 druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 1/1 druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Pending 0 0s
Pending 0 0s
Pending 0 0s
Pending 0 1s
Pending 0 1s
Pending 0 1s
Pending 0 1s
Pending 0 1s
Pending 0 1s
Pending 0 1s
Pending 0 1s
Pending 0 1s
Pending 0 1s
ContainerCreating 0 2s
ContainerCreating 0 2s
ContainerCreating 0 2s
Pending 0 2s
Pending 0 2s
Pending 0 2s
Pending 0 2s
Pending 0 2s
Pending 0 2s
ContainerCreating 0 2s
ContainerCreating 0 2s
ContainerCreating 0 3s
Pending 0 3s
ContainerCreating 0 3s
ContainerCreating 0 3s
ContainerCreating 0 3s
ContainerCreating 0 3s
ContainerCreating 0 6s
ContainerCreating 0 8s
ContainerCreating 0 8s
ContainerCreating 0 9s
Running 0 1m
Running 0 1m
Running 0 1m
Running 0 1m
Running 0 1m
Running 0 2m
Running 0 13m
Running 0 13m
Running 0 14m
Running 0 14m
Running 0 14m
Running 0 14m
Pending 0 0s
Pending 0 1s
Pending 0 1s
Pending 0 2s
Pending 0 2s
ContainerCreating 0 2s
Pending 0 2s
ContainerCreating 0 3s
ContainerCreating 0 3s
ContainerCreating 0 3s
Running 0 3s
Running 0 4s
Running 0 30s
Running 0 30s
Pending 0 0s
Pending 0 0s
Pending 0 0s
Pending 0 0s
Pending 0 1s
Pending 0 1s
ContainerCreating 0 1s
ContainerCreating 0 1s
ContainerCreating 0 2s
ContainerCreating 0 3s
Running 0 3s
Running 0 4s

This is what is happening in my current setup once historical, MM and Broker pods are up then only coordinator and routers are triggered

AdheipSingh · 2022-09-24T01:57:21Z

are you adding any node selectors ? are you coordinator/routers get scheduled on different nodes
if you are rollingDeploy is set to false, all pods should come in parallel, there is no conditional check to wait for druid nodes on the first installation.

bharathkuppala · 2022-09-26T04:08:20Z

@AdheipSingh : No we are not using any node selectors for scheduling the pods. And we are using rollingDeploy as true because we want our pods to have an upgrade based on prescribed order.

Yea i see on druid-operator code we are not checking generation so pods should come in parallel. I can share logs of druid-operator so we could have a complete trace.

AdheipSingh · 2022-10-02T09:03:14Z

@bharathkuppala whats your podmanagement policy ? try setting to Parallel

bharathkuppala · 2022-10-14T07:46:23Z

@AdheipSingh : Yes it is by default parallel right for sts?

bharathkuppala mentioned this issue Sep 8, 2022

How is stable is this project? #303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Druid pods taking more time to come up when security enabled #301

Druid pods taking more time to come up when security enabled #301

bharathkuppala commented Aug 3, 2022

bharathkuppala commented Aug 17, 2022

AdheipSingh commented Aug 17, 2022 •

edited

Loading

bharathkuppala commented Sep 8, 2022 •

edited

Loading

bharathkuppala commented Sep 8, 2022

AdheipSingh commented Sep 8, 2022 •

edited

Loading

bharathkuppala commented Sep 8, 2022

bharathkuppala commented Sep 9, 2022

AdheipSingh commented Sep 9, 2022

bharathkuppala commented Sep 9, 2022

bharathkuppala commented Sep 9, 2022 •

edited

Loading

AdheipSingh commented Sep 24, 2022

bharathkuppala commented Sep 26, 2022

AdheipSingh commented Oct 2, 2022

bharathkuppala commented Oct 14, 2022

Druid pods taking more time to come up when security enabled #301

Druid pods taking more time to come up when security enabled #301

Comments

bharathkuppala commented Aug 3, 2022

bharathkuppala commented Aug 17, 2022

AdheipSingh commented Aug 17, 2022 • edited Loading

bharathkuppala commented Sep 8, 2022 • edited Loading

bharathkuppala commented Sep 8, 2022

AdheipSingh commented Sep 8, 2022 • edited Loading

bharathkuppala commented Sep 8, 2022

bharathkuppala commented Sep 9, 2022

AdheipSingh commented Sep 9, 2022

bharathkuppala commented Sep 9, 2022

bharathkuppala commented Sep 9, 2022 • edited Loading

AdheipSingh commented Sep 24, 2022

bharathkuppala commented Sep 26, 2022

AdheipSingh commented Oct 2, 2022

bharathkuppala commented Oct 14, 2022

AdheipSingh commented Aug 17, 2022 •

edited

Loading

bharathkuppala commented Sep 8, 2022 •

edited

Loading

AdheipSingh commented Sep 8, 2022 •

edited

Loading

bharathkuppala commented Sep 9, 2022 •

edited

Loading