WIP: KEP-127: Update PRR for beta

rata · rata · commit 97081ca8614a · 2024-01-11T18:09:13.000+01:00
Signed-off-by: Rodrigo Campos &lt;rodrigoca@microsoft.com&gt;
diff --git a/keps/sig-node/127-user-namespaces/README.md b/keps/sig-node/127-user-namespaces/README.md
@@ -764,6 +764,15 @@ This section must be completed when targeting beta to a release.
 
 ###### How can a rollout or rollback fail? Can it impact already running workloads?
 
+The rollout is just a feature flag on the kubelet and the kube-apiserver.
+
+If one API server is upgraded while others aren't, then that API server might accept pods while
+others might reject those pods using the pod.spec.HostUsers field.
+
+On a rollback, pods created while the feature was active (created with user namespaces) will have to
+be restarted to be re-created without user namespaces. Just a re-creation of the pod will do the
+trick.
+
 <!--
 Try to be as paranoid as possible - e.g., what if some components will restart
 mid-rollout?
@@ -776,21 +785,34 @@ will rollout across nodes.
 
 ###### What specific metrics should inform a rollback?
 
+On Kubernetes side, the kubelet should start correctly.
+
+On the node runtime side, a pod created with pod.spec.HostUsers=false should be running fine if all
+node requirements are met.
 <!--
 What signals should users be paying attention to when the feature is young
 that might indicate a serious problem?
 -->
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
+Yes.
+
+We tested to enable the feature flag, create a deployment with pod.spec.HostUsers=false, and then disable
+the feature flag and restart the kubelet and kube-apiserver.
+
+After that, we deleted the deployment pods, the pods were re-created without user namespaces just
+fine, without any modification needed on the deployment yaml.
 <!--
+TODO: rata. Test this!
 Describe manual testing that was done and the outcomes.
 Longer term, we may want to require automated upgrade/rollback tests, but we
 are missing a bunch of machinery and tooling and can't do that now.
 -->
 
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
+No.
 <!--
 Even if applying deprecation policies, they may still surprise some users.
 -->
@@ -806,6 +828,7 @@ previous answers based on experience in the field.
 
 ###### How can an operator determine if the feature is in use by workloads?
 
+Check if any pod has the pod.spec.HostUsers field set to false.
 <!--
 Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
 checking if there are objects with field X set) may be a last resort. Avoid
@@ -814,6 +837,11 @@ logs or events for this purpose.
 
 ###### How can someone using this feature know that it is working for their instance?
 
+Check if any pod has the pod.spec.HostUsers field set to false and is running correctly on a node
+that meets all the requirements.
+
+There are ste-by-step examples in the Kubernetes documentation too.
+
 <!--
 For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
 for each individual pod.
@@ -828,11 +856,18 @@ Recall that end users cannot usually observe component logs or access metrics.
 - [ ] API .status
   - Condition name: 
   - Other field: 
-- [ ] Other (treat as last resort)
-  - Details:
+- [x] Other (treat as last resort)
+  - Details: check pods with pod.spec.HostUsers field set to false, and see if they are running
+    fine.
 
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
+If a node meets all the requirements, there should be no change to existing SLO/SLIs.
+
+If a container runtime wants to support old kernels, it can have a performance impact, though. For
+more details, see the question:
+    "Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?"
+
 <!--
 This is your opportunity to define what "normal" quality of service looks like
 for a feature.
@@ -850,6 +885,7 @@ question.
 
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
+No new SLI needed for this feature.
 <!--
 Pick one more of these and delete the rest.
 -->
@@ -863,6 +899,17 @@ Pick one more of these and delete the rest.
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
+No.
+
+This feature is using yet another namespace when creating a pod. If the pod creation fails (by
+an error on the kubelet or returned by the container runtime), a clear error is returned to the
+user.
+
+A metric like "errors returned in pods with user namespaces enabled" can be very noisy, as the error
+can be completely unrelated (image pull secret errors, configmap referenced and not defined, any
+other container runtime error, etc.). We can't see any metric that can be helpful, as the user has a
+very direct feedback already.
+
 <!--
 Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
 implementation difficulties, etc.).
@@ -876,6 +923,19 @@ This section must be completed when targeting beta to a release.
 
 ###### Does this feature depend on any specific services running in the cluster?
 
+Yes:
+
+- [CRI version]
+  - Usage description: CRI changes done in k8s 1.27 are needed
+    - Impact of its outage on the feature: minimal, feature will be ignored by runtimes using an
+      older version.
+    - Impact of its degraded performance or high-error rates on the feature: N/A.
+
+- [Linux kernel]
+  - Usage description: Linux 6.3 or higher
+    - Impact of its outage on the feature: pod creation will return an error.
+    - Impact of its degraded performance or high-error rates on the feature: N/A.
+
 <!--
 Think about both cluster-level services (e.g. metrics-server) as well
 as node-level agents (e.g. specific version of CRI). Focus on external or
@@ -1028,8 +1088,8 @@ and validate the declared limits?
 
 The kubelet is spliting the host UID/GID space for different pods, to use for
 their user namespace mapping. The design allows for 65k pods per node, and the
-resource is limited to maxPods per node (currently maxPods defaults to 110, it is unlikely we will
-reach the host limit).
+resource is limited to maxPods per node (currently maxPods defaults to 110, it
+is unlikely we will reach 65k).
 
 For container runtimes, they might use more disk space or inodes to chown the
 rootfs. This is if they chose to support this feature without relying on new
@@ -1056,8 +1116,77 @@ details). For now, we leave it here.
 
 ###### How does this feature react if the API server and/or etcd is unavailable?
 
+No changes to current kubelet behaviors. The feature only uses kubelet-local information.
+
 ###### What are other known failure modes?
 
+- Some filesystem used by the pod doesn't support idmap mounts on the kernel used.
+  - Detection: How can it be detected via metrics? Stated another way:
+    how can an operator troubleshoot without logging into a master or worker node?
+
+    Just see the pod events, it fails with:
+
+	  Warning  Failed     2s (x2 over 4s)  kubelet, 127.0.0.1  Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: failed to fulfil mount request: failed to set MOUNT_ATTR_IDMAP on /var/lib/kubelet/pods/f037a704-742c-40fe-8dbf-17ed9225c4df/volumes/kubernetes.io~empty-dir/hugepage: invalid argument (maybe the source filesystem doesn't support idmap mounts on this kernel?): unknown
+
+Note the "maybe the source filesystem doesn't support idmap mounts on this kernel?" part.
+
+  - Mitigations: What can be done to stop the bleeding, especially for already
+    running user workloads?
+
+Remove the pod.spec.HostUsers field or disable the feature gate.
+
+  - Diagnostics: What are the useful log messages and their required logging
+    levels that could help debug the issue?
+    Not required until feature graduated to beta.
+
+
+  - Testing: Are there any tests for failure mode? If not, describe why.
+
+TODO: rata.
+
+- Error getting the userns IDs range configuration
+  - Detection: How can it be detected via metrics? Stated another way:
+    how can an operator troubleshoot without logging into a master or worker node?
+
+Pod errors
+
+  - Mitigations: What can be done to stop the bleeding, especially for already
+    running user workloads?
+
+Disable feature flag
+
+  - Diagnostics: What are the useful log messages and their required logging
+    levels that could help debug the issue?
+    Not required until feature graduated to beta.
+
+TODO
+
+  - Testing: Are there any tests for failure mode? If not, describe why.
+
+TODO
+
+- Error to save/read pod mappings
+  - Detection: How can it be detected via metrics? Stated another way:
+    how can an operator troubleshoot without logging into a master or worker node?
+
+  - Mitigations: What can be done to stop the bleeding, especially for already
+    running user workloads?
+  - Diagnostics: What are the useful log messages and their required logging
+    levels that could help debug the issue?
+    Not required until feature graduated to beta.
+  - Testing: Are there any tests for failure mode? If not, describe why.
+
+- Other errors
+  - Detection: How can it be detected via metrics? Stated another way:
+    how can an operator troubleshoot without logging into a master or worker node?
+  - Mitigations: What can be done to stop the bleeding, especially for already
+    running user workloads?
+  - Diagnostics: What are the useful log messages and their required logging
+    levels that could help debug the issue?
+    Not required until feature graduated to beta.
+  - Testing: Are there any tests for failure mode? If not, describe why.
+
+
 <!--
 For each of them, fill in the following information by copying the below template:
   - [Failure mode brief description]