You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: keps/sig-node/127-user-namespaces/README.md
+133-4
Original file line number
Diff line number
Diff line change
@@ -764,6 +764,15 @@ This section must be completed when targeting beta to a release.
764
764
765
765
###### How can a rollout or rollback fail? Can it impact already running workloads?
766
766
767
+
The rollout is just a feature flag on the kubelet and the kube-apiserver.
768
+
769
+
If one API server is upgraded while others aren't, then that API server might accept pods while
770
+
others might reject those pods using the pod.spec.HostUsers field.
771
+
772
+
On a rollback, pods created while the feature was active (created with user namespaces) will have to
773
+
be restarted to be re-created without user namespaces. Just a re-creation of the pod will do the
774
+
trick.
775
+
767
776
<!--
768
777
Try to be as paranoid as possible - e.g., what if some components will restart
769
778
mid-rollout?
@@ -776,21 +785,34 @@ will rollout across nodes.
776
785
777
786
###### What specific metrics should inform a rollback?
778
787
788
+
On Kubernetes side, the kubelet should start correctly.
789
+
790
+
On the node runtime side, a pod created with pod.spec.HostUsers=false should be running fine if all
791
+
node requirements are met.
779
792
<!--
780
793
What signals should users be paying attention to when the feature is young
781
794
that might indicate a serious problem?
782
795
-->
783
796
784
797
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
785
798
799
+
Yes.
800
+
801
+
We tested to enable the feature flag, create a deployment with pod.spec.HostUsers=false, and then disable
802
+
the feature flag and restart the kubelet and kube-apiserver.
803
+
804
+
After that, we deleted the deployment pods, the pods were re-created without user namespaces just
805
+
fine, without any modification needed on the deployment yaml.
786
806
<!--
807
+
TODO: rata. Test this!
787
808
Describe manual testing that was done and the outcomes.
788
809
Longer term, we may want to require automated upgrade/rollback tests, but we
789
810
are missing a bunch of machinery and tooling and can't do that now.
790
811
-->
791
812
792
813
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
793
814
815
+
No.
794
816
<!--
795
817
Even if applying deprecation policies, they may still surprise some users.
796
818
-->
@@ -806,6 +828,7 @@ previous answers based on experience in the field.
806
828
807
829
###### How can an operator determine if the feature is in use by workloads?
808
830
831
+
Check if any pod has the pod.spec.HostUsers field set to false.
809
832
<!--
810
833
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
811
834
checking if there are objects with field X set) may be a last resort. Avoid
@@ -814,6 +837,11 @@ logs or events for this purpose.
814
837
815
838
###### How can someone using this feature know that it is working for their instance?
816
839
840
+
Check if any pod has the pod.spec.HostUsers field set to false and is running correctly on a node
841
+
that meets all the requirements.
842
+
843
+
There are ste-by-step examples in the Kubernetes documentation too.
844
+
817
845
<!--
818
846
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
819
847
for each individual pod.
@@ -828,11 +856,18 @@ Recall that end users cannot usually observe component logs or access metrics.
828
856
-[ ] API .status
829
857
- Condition name:
830
858
- Other field:
831
-
-[ ] Other (treat as last resort)
832
-
- Details:
859
+
-[x] Other (treat as last resort)
860
+
- Details: check pods with pod.spec.HostUsers field set to false, and see if they are running
861
+
fine.
833
862
834
863
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
835
864
865
+
If a node meets all the requirements, there should be no change to existing SLO/SLIs.
866
+
867
+
If a container runtime wants to support old kernels, it can have a performance impact, though. For
868
+
more details, see the question:
869
+
"Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?"
870
+
836
871
<!--
837
872
This is your opportunity to define what "normal" quality of service looks like
838
873
for a feature.
@@ -850,6 +885,7 @@ question.
850
885
851
886
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
852
887
888
+
No new SLI needed for this feature.
853
889
<!--
854
890
Pick one more of these and delete the rest.
855
891
-->
@@ -863,6 +899,17 @@ Pick one more of these and delete the rest.
863
899
864
900
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
865
901
902
+
No.
903
+
904
+
This feature is using yet another namespace when creating a pod. If the pod creation fails (by
905
+
an error on the kubelet or returned by the container runtime), a clear error is returned to the
906
+
user.
907
+
908
+
A metric like "errors returned in pods with user namespaces enabled" can be very noisy, as the error
909
+
can be completely unrelated (image pull secret errors, configmap referenced and not defined, any
910
+
other container runtime error, etc.). We can't see any metric that can be helpful, as the user has a
911
+
very direct feedback already.
912
+
866
913
<!--
867
914
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
868
915
implementation difficulties, etc.).
@@ -876,6 +923,19 @@ This section must be completed when targeting beta to a release.
876
923
877
924
###### Does this feature depend on any specific services running in the cluster?
878
925
926
+
Yes:
927
+
928
+
-[CRI version]
929
+
- Usage description: CRI changes done in k8s 1.27 are needed
930
+
- Impact of its outage on the feature: minimal, feature will be ignored by runtimes using an
931
+
older version.
932
+
- Impact of its degraded performance or high-error rates on the feature: N/A.
933
+
934
+
-[Linux kernel]
935
+
- Usage description: Linux 6.3 or higher
936
+
- Impact of its outage on the feature: pod creation will return an error.
937
+
- Impact of its degraded performance or high-error rates on the feature: N/A.
938
+
879
939
<!--
880
940
Think about both cluster-level services (e.g. metrics-server) as well
881
941
as node-level agents (e.g. specific version of CRI). Focus on external or
@@ -1028,8 +1088,8 @@ and validate the declared limits?
1028
1088
1029
1089
The kubelet is spliting the host UID/GID space for different pods, to use for
1030
1090
their user namespace mapping. The design allows for 65k pods per node, and the
1031
-
resource is limited to maxPods per node (currently maxPods defaults to 110, it is unlikely we will
1032
-
reach the host limit).
1091
+
resource is limited to maxPods per node (currently maxPods defaults to 110, it
1092
+
is unlikely we will reach 65k).
1033
1093
1034
1094
For container runtimes, they might use more disk space or inodes to chown the
1035
1095
rootfs. This is if they chose to support this feature without relying on new
@@ -1056,8 +1116,77 @@ details). For now, we leave it here.
1056
1116
1057
1117
###### How does this feature react if the API server and/or etcd is unavailable?
1058
1118
1119
+
No changes to current kubelet behaviors. The feature only uses kubelet-local information.
1120
+
1059
1121
###### What are other known failure modes?
1060
1122
1123
+
- Some filesystem used by the pod doesn't support idmap mounts on the kernel used.
1124
+
- Detection: How can it be detected via metrics? Stated another way:
1125
+
how can an operator troubleshoot without logging into a master or worker node?
1126
+
1127
+
Just see the pod events, it fails with:
1128
+
1129
+
Warning Failed 2s (x2 over 4s) kubelet, 127.0.0.1 Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: failed to fulfil mount request: failed to set MOUNT_ATTR_IDMAP on /var/lib/kubelet/pods/f037a704-742c-40fe-8dbf-17ed9225c4df/volumes/kubernetes.io~empty-dir/hugepage: invalid argument (maybe the source filesystem doesn't support idmap mounts on this kernel?): unknown
1130
+
1131
+
Note the "maybe the source filesystem doesn't support idmap mounts on this kernel?" part.
1132
+
1133
+
- Mitigations: What can be done to stop the bleeding, especially for already
1134
+
running user workloads?
1135
+
1136
+
Remove the pod.spec.HostUsers field or disable the feature gate.
1137
+
1138
+
- Diagnostics: What are the useful log messages and their required logging
1139
+
levels that could help debug the issue?
1140
+
Not required until feature graduated to beta.
1141
+
1142
+
1143
+
- Testing: Are there any tests for failure mode? If not, describe why.
1144
+
1145
+
TODO: rata.
1146
+
1147
+
- Error getting the userns IDs range configuration
1148
+
- Detection: How can it be detected via metrics? Stated another way:
1149
+
how can an operator troubleshoot without logging into a master or worker node?
1150
+
1151
+
Pod errors
1152
+
1153
+
- Mitigations: What can be done to stop the bleeding, especially for already
1154
+
running user workloads?
1155
+
1156
+
Disable feature flag
1157
+
1158
+
- Diagnostics: What are the useful log messages and their required logging
1159
+
levels that could help debug the issue?
1160
+
Not required until feature graduated to beta.
1161
+
1162
+
TODO
1163
+
1164
+
- Testing: Are there any tests for failure mode? If not, describe why.
1165
+
1166
+
TODO
1167
+
1168
+
- Error to save/read pod mappings
1169
+
- Detection: How can it be detected via metrics? Stated another way:
1170
+
how can an operator troubleshoot without logging into a master or worker node?
1171
+
1172
+
- Mitigations: What can be done to stop the bleeding, especially for already
1173
+
running user workloads?
1174
+
- Diagnostics: What are the useful log messages and their required logging
1175
+
levels that could help debug the issue?
1176
+
Not required until feature graduated to beta.
1177
+
- Testing: Are there any tests for failure mode? If not, describe why.
1178
+
1179
+
- Other errors
1180
+
- Detection: How can it be detected via metrics? Stated another way:
1181
+
how can an operator troubleshoot without logging into a master or worker node?
1182
+
- Mitigations: What can be done to stop the bleeding, especially for already
1183
+
running user workloads?
1184
+
- Diagnostics: What are the useful log messages and their required logging
1185
+
levels that could help debug the issue?
1186
+
Not required until feature graduated to beta.
1187
+
- Testing: Are there any tests for failure mode? If not, describe why.
1188
+
1189
+
1061
1190
<!--
1062
1191
For each of them, fill in the following information by copying the below template:
0 commit comments