FIx hostname generation issue preventing distributed operations #1833
+45
−23
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The ClickHouse operator generates incorrect hostnames in
remote_servers.xml
configuration, causing DNS resolution failures and breaking distributed operations in multi-node clustered deployments (sharded and/or replicated setups).Issue Details:
chi-db-clickhouse-db-{shard}-{replica}
(e.g.,chi-db-clickhouse-db-0-0
)chi-db-clickhouse-db-{shard}-{replica}-{ordinal}
(e.g.,chi-db-clickhouse-db-0-0-0
)This mismatch causes all nodes to show
is_local=0
insystem.clusters
, breaking distributed operations andON CLUSTER
commands.Root Cause
The operator was not following Kubernetes StatefulSet DNS naming conventions. StatefulSets use a specific DNS pattern:
The
createPodFQDN
function was incorrectly usingcreatePodHostname()
(service name) instead ofcreatePodName()
(actual pod name with-0
ordinal suffix). While the service name would work for network connectivity, ClickHouse'sis_local
detection requires the hostname inremote_servers.xml
to exactly match the pod's actual hostname for proper cluster node identification.Solution: Fixed Both Hostname Generation Functions
Modified both
createPodHostname
andcreatePodFQDN
functions in CHI and CHK namers:1. Fixed
createPodHostname()
Before (broken): Returned service name without ordinal
After (fixed): Returns actual pod name with ordinal
2. Fixed
createPodFQDN()
Before (broken): Used service name in FQDN
After (fixed): Uses proper StatefulSet DNS pattern
This ensures both functions return pod names that match actual StatefulSet pod hostnames, enabling proper
is_local
detection and DNS resolution.Files Changed:
pkg/model/chi/namer/name.go
- Implemented proper StatefulSet DNS pattern for CHIpkg/model/chk/namer/name.go
- Implemented proper StatefulSet DNS pattern for CHKCompatibility with namespaceDomainPattern
This fix is fully compatible with the existing
namespaceDomainPattern
functionality. When users specify a custom domain pattern like:The implementation properly handles both cases:
<pod-name>.<headless-service-name>.<namespace>.svc.cluster.local
<pod-name>.<headless-service-name>.<custom-domain-pattern>
The
%s
placeholder in namespaceDomainPattern gets replaced with the namespace name, maintaining full backward compatibility while fixing the underlying DNS resolution issues.Impact
ON CLUSTER
operations: Distributed DDL now works in sharded configurationsis_local=0
issue: All cluster nodes correctly identify themselves as localextraConfig
remote_servers overridesnamespaceDomainPattern
overridesOperator Log Messages Explained
This fix resolves continuous operator log messages like:
These occur because the operator's
IsHostInCluster()
function queries:When hostnames mismatch, this always returns 0 (no local node found), causing the operator to repeatedly log that hosts are "outside" the cluster even when they're functioning correctly.
Testing & Validation
Production Tested:
ON CLUSTER
DDL operations on 4-node cluster (2 shards, 2 replicas each)namespaceDomainPattern
compatibilityTechnical Details: StatefulSet DNS Pattern Implementation
The fix implements the standard Kubernetes StatefulSet DNS pattern by ensuring FQDNs follow:
Key Components:
chi-clickhouse-clickhouse-0-0-0
(includes-0
ordinal)chi-clickhouse-clickhouse-0-0
(StatefulSet service)<namespace>.svc.cluster.local
(or custom vianamespaceDomainPattern
)This ensures proper DNS resolution for StatefulSet pods while maintaining compatibility with all existing cluster configurations and custom domain patterns.
Verification
After this fix, users will no longer need manual
extraConfig
overrides. The operator automatically generates correct hostnames inremote_servers.xml
that match actual StatefulSet pod names and DNS patterns.Example of corrected hostname generation:
chi-db-clickhouse-db-0-0.namespace.svc.cluster.local
chi-db-clickhouse-db-0-0-0.chi-db-clickhouse-db-0-0.namespace.svc.cluster.local
This change ensures ClickHouse can properly identify local replicas and enables all distributed operations to work correctly out of the box with proper StatefulSet DNS resolution.