Openshift upgrades happen as follows:
Consider this small example cluster:
master-0
master-1
master-2
worker-10
worker-11
worker-12
worker-13
loadbalancer-14
loadbalancer-15
In the above example cluster, there are three machine config pools: masters, workers, loadbalancers. This is an example cluster configuration, there may be more machine config pools based on functionality, e.g., 10 MCPs if needed.
When the cluster is upgraded, the API server and etcD are updated first. So the master config pool will be done first. Incrementally the cluster will go through and reboot master-0, 1, 2 to bring them to the new kubernetes version. After these are updated it will cycle to the next two machine pools one at a time. Openshift will consult the maxunavilable nodes in the machine config pool spec and reboot only as many as allowed by maxunavailable.
In a cluster as small as the above, maxUnavailable
would be set to 1, so OpenShift would reboot loadbalancer-14 and worker-10 simultaneously as they are different machineconfigpools.
Openshift will wait until worker-10 is ready before proceeding onwards to worker-11 and continue. OpenShift will in parallel wait for loadbalancer-14 to become available again before restarting loadbalancer-15.
In clusters larger than the example cluster, the maxUnavailable
for the worker pool may be set to a large number to reboot multiple nodes in parallel to speed up deployment of the new version of OpenShift. This number will take into account the work loads on the cluster to make sure sufficient resources are left to maintain application availability.
For an application to stay healthy during this process, if they are stateful at all, they should specify a statefulset or replicaset, kubernetes by default will attempt to schedule the set members across multiple nodes to give additional resiliency. In order to prevent kubernetes from stealing too many nodes out from under an application, an application that has a minimum number of pods that need to be running must specify a pod disruption budget. Pod disruption budgets allow an application to tell kubernetes that it needs N number of pods of said microservice alive at any given time. For example, a small stateful database may need 2 out of three pods available at any given time, so that application should set a pod disruption budget with a minavailable set to a value of 2. This will allow the scheduler to know that it should not take the second pod out of a set of 3 down at any given time during the series of node reboots.
Note
|
Do NOT set your pod disruption budget to |
A corollary to the pod disruption budget is a strong readiness and health check. A well implemented readiness check is key for surviving these upgrades in that a pod should not report itself ready to kubernetes until it is actually ready to take over the load from another pod of the example set. An example of this being implemented poorly would be for a pod to report itself ready but it is not in sync with the other DB pods in the example above. Kubernetes could see that three of the pods are "ready" and destroy a second pod and cause disruption to the DB leading to failure of the application served by said DB.
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: db-pod-disruption-budget
spec:
minAvailable: 2
selector:
matchLabels:
app: db
By default, only one machine is allowed to be unavailable when applying the kubelet-related configuration to the available worker nodes. For a large cluster, it can take a long time for the configuration change to be reflected. At any time, you can adjust the number of machines that are updating to speed up the process.
Run:
$ oc edit machineconfigpool worker
Set maxUnavailable
to the desired value.
spec:
maxUnavailable: <node_count>