-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][DNM] Add support for updating oom_score_adj #4669
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: karthik-k-n <[email protected]> rebaes
ea63098
to
e18a918
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid there is an issue adjusting OOM score for a running container.
When we start a new container, we set oom_score_adj
for container init (pid 1) process, which is responsible for running all other container's processes. Those other processes are all childen of init and will naturally derive init's oom score.
When you set oom_score_adj for a running container, you set it for container init only, and possibly processes which will be started later. This means the processes that are are already running won't get the oom score new value.
@kolyshkin Thanks for the suggestion. |
The issue here is, the container itself (i.e. a process in it, like systemd) may also change the oom_score_adj of some of its children. Meaning, when you change it on top of it, any in-container configuration is lost. So, I think, we should treat oom_score_adj as "set on container start only". If you disagree, please describe your use case in details. (I'd like to have per-cgroup analog oom_score_adj, but there's no such thing. To me, this should be controlled indirectly, by setting higher memory limits for containers which you don't want to be OOM killed) |
Sure Thanks, I can provide more context behind this use case We are proposing a new KEP Node Resource Hot Plug where in which we are adding a feature to hotplug compute resources like memory, cpu to node. Currently in Kubernetes for a container the oom_score_adj is calculated using below formula
Consider a scenario where a node has few running pods and we hotplug resources to add memory to it. Example scenario
In the above scenario Pod 3 and Pod 5 more prone to be killed in the situation where there is memory pressure. However if recalculate the oom_score_adj post resize for the pods that are existed before:
POD 3 likely to be killed first in-case of memory pressure. Post resize without recalculation oom_score_adj, Newly created pods are more prone to eviction due to higher OOM score. Please let me know is there any better way to handle this scenario. |
Does setting |
To add another use case: for in-place pod resize (https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources), we allow Kubernetes memory requests to be adjusted without restart, and the oom_score_adj is calculated from memory requests, so we'd like to be able to update it. To the comment above, I'd be fine with saying we only support it with cgroups v2 when the group oom-killer behavior is set (the default on k8s with cgroups v2). |
Since the The only thing you need to maybe add is to add a minimum value to the formula so the oom_score_adj stay above 0 (keeping the negative values to protect important system tasks (like ssh) that should not be killed). So, here's the new formula: oom_score := min(0, 1000 - (1000*containerMemoryRequest)/initialMemoryCapacity) With that, here's what we have before the resize:
And after the resize:
Now, what if instead of pods 4 and 5 we add a pod with 8 GB of RAM?
|
Thank you for the great idea, Somehow it didn't cross our minds as we were only exploring the aspect of recalculating and updating the oom_score_adj. We shall also check this in sig-node for their opinion on this. |
Depending on a use case, it may or may not make sense, this is why it's a per-cgroup knob. |
Add support for updating oom_score_adj