If something isn't working as expected in this tutorial, try the following steps
If kubectl
, kind
or clusterctl
return an error like the following:
Unable to connect to the server: net/http:TLS handshaked timeout
It's likely that Docker has become unresponsive. Troubleshooting depends on your system. If you're using Docker Desktop you may have to restart it and restart the tutorial. If your using linux you may be able to troubleshoot by running systemctl status docker
and reviewing the logs.
Run:
clusterctl describe cluster --show-conditions=true docker-cluster-one
The output of this command will show the specific steps that have failed for your cluster - e.g. Bootstrap failed means your answers might be found on the KubeadmConfig or in the DockerMachine object.
If you've identified an object that is not being correctly created or updated, you can look at more detailed information about it using kubectl describe $OBJECT_TYPE $OBJECT_NAME
To list these objects run kubectl get -A $OBJECT_TYPE
For clusters run:
kubectl get -A clusters
For machines run:
kubectl get -A machines
To see the details of specific resources use:
For clusters:
kubectl describe cluster -n docker-cluster-namespace docker-cluster-one #use the namespace and name of the desired cluster
For machines:
kubectl describe machine -n docker-cluster-namespace docker-cluster-one-md-fjkd-309127 #use the name of the desired machine
The list of conditions on objects, and the messages associated with them, may help point to which controller is responsible for the issue.
This tutorial uses Cluster API Provider Docker for infrastructure - that means that each node is running in a Docker container.
To see these containers run:
docker ps
With a single cluster docker-cluster-one
with one control plane node and one worker node you should see output like the following:
0c7669d2b090 kindest/node:v1.24.6 "/usr/local/bin/entr…" 30 seconds ago Up 29 seconds docker-cluster-one-md-0-x2qmh-5596d8b774-72nsr
69c412fd86d9 kindest/node:v1.24.6 "/usr/local/bin/entr…" About a minute ago Up About a minute 45305/tcp, 127.0.0.1:45305->6443/tcp docker-cluster-one-w2vpl-7bldj
48444f6c42f5 kindest/haproxy:v20210715-a6da3463 "haproxy -sf 7 -W -d…" About a minute ago Up About a minute 33877/tcp, 0.0.0.0:33877->6443/tcp docker-cluster-one-lb
fb9dcfb50740 kindest/node:v1.25.2 "/usr/local/bin/entr…" 4 minutes ago Up 4 minutes 127.0.0.1:33369->6443/tcp kubecon-na-22-capi-lab-control-plane
The first container in the list is the worker node. Its name - at the far right column - is docker-cluster-one-md-0
followed by a suffix. md-0
is the name of the MachineDeployment the worker is a part of.
The second container is the control plane node.
The third container is a load balancer used for the control plane.
You may be able to find useful information about your error in the logs of the Cluster API components. Searching error strings on this troubleshooting guide, or in the Cluster API book may point to a solution for your problem.
For the CAPI (Core) controller manager:
kubectl logs -n capd-system deployments/capd-controller-manager
For the CAPD (Docker infrastructure) controller manager:
kubectl logs -n capi-system deployments/capi-controller-manager
For the CAPBK (Kubeadm bootstrap) controller manager:
kubectl logs -n capi-kubeadm-bootstrap-system deployments/capi-kubeadm-bootstrap-controller-manager
For the CAPI Kubeadm control plane controller manager:
kubectl logs -n capi-kubeadm-control-plane-system deployments/capi-kubeadm-control-plane-controller-manager
During the course of the tutorial you might end up in an unstable state. If you can't solve an issue, and you'd like to start the tutorial over again see our cleaning up guide
Docker may run out of space for keeping volumes and images required for this tutorial. This can especially be the case if you're doing the tutorial in a pre-existing docker environment which may have hanging volumes and images. To clean up space run:
docker system prune
You may need to do use superuser permissions (sudo
) for the commands in this section.
When running this tutorial on a Fedora installation there is an issue that can prevent clusters from advancing past the Provisioning
stage.
This problem can cause system instability. The root cause is a resource leak when Docker runs haproxy with the default Fedora configuration.
To fix this issue you need to find the Docker systemd unit file. On most systems it should be at /usr/lib/systemd/system/docker.service
. To find where it is on your system run:
systemctl status docker | grep loaded
Output:
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Open this file with a text editor:
e.g.
nano /usr/lib/systemd/system/docker.service
The file should have a section that looks like:
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always
Modify it by adding the --default-ulimit nofile=65883:65883
argument to the ExecStart
line. The line should look similar to the below:
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=65883:65883
Write and exit the file, and reload the systemd config and Docker service using:
systemctl daemon-reload
systemctl restart docker
You can check the status of the Docker service using:
systemctl status docker
The output on a successful reconfiguration should look like the below
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2022-10-19 14:31:51 BST; 13s ago
TriggeredBy: ● docker.socket
Docs: https://docs.docker.com
Main PID: 13857 (dockerd)
Tasks: 49
Memory: 40.9M
CPU: 785ms
CGroup: /system.slice/docker.service
├─ 13857 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=65883:65883 #ulimit nofile limit is set
├─ 14040 /usr/bin/docker-proxy -proto tcp -host-ip 127.0.0.1 -host-port 41825 -container-ip 172.19.0.2 -container-port 6443
├─ 14066 /usr/bin/docker-proxy -proto tcp -host-ip 127.0.0.1 -host-port 38625 -container-ip 172.19.0.3 -container-port 6443
└─ 14143 /usr/bin/docker-proxy -proto tcp -host-ip 127.0.0.1 -host-port 5000 -container-ip 172.19.0.4 -container-port 5000
Once this change is working a Fedora system should be able to run the load balancer container correctly.
When getting logs from CAPI or Kubernetes components you may see log lines containing the string "too many open files". On Linux you can resolve this issue by setting inotify limits on your system.
To do so run:
sudo sysctl fs.inotify.max_user_watches=524288
sudo sysctl fs.inotify.max_user_instances=512
Verify these values by running:
sysctl -a | grep fs.inotify.max_user
If this error occurs on macOS, you may need to update the version of Docker Desktop you're using.
In Windows using Docker Desktop 4.10.1 with wsl 2 you may see this in the logs of containers in the Docker Desktop UI.
From a powershell interface enter the Docker Desktop WSL2 environment using:
wsl
Run:
sysctl fs.inotify.max_user_watches=524288
sysctl fs.inotify.max_user_instances=512
To change the values.
Verify the changes by running:
sysctl -a | grep fs.inotify.max_user
If the capd logs contain the line: "failed to pre-load images into the DockerMachine" you should run the prepull images script to ensure you have the images stored locally.
On macOS and Linux run:
./scripts/prepull-images.sh
Or on Windows:
.\scripts\prepull-images.ps1
The following resource may help you to identify problems: