Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DNM] Add support for updating oom_score_adj #4669

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Karthik-K-N
Copy link

@Karthik-K-N Karthik-K-N commented Mar 10, 2025

Add support for updating oom_score_adj

Signed-off-by: karthik-k-n <[email protected]>

rebaes
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid there is an issue adjusting OOM score for a running container.

When we start a new container, we set oom_score_adj for container init (pid 1) process, which is responsible for running all other container's processes. Those other processes are all childen of init and will naturally derive init's oom score.

When you set oom_score_adj for a running container, you set it for container init only, and possibly processes which will be started later. This means the processes that are are already running won't get the oom score new value.

@Karthik-K-N
Copy link
Author

@kolyshkin Thanks for the suggestion.
Provided we have a init process and few child process, If we want to update the child process oom_score_adj along with parent any pointers on how we can get the child process today so we can update them too.

@kolyshkin
Copy link
Contributor

@kolyshkin Thanks for the suggestion. Provided we have a init process and few child process, If we want to update the child process oom_score_adj along with parent any pointers on how we can get the child process today so we can update them too.

The issue here is, the container itself (i.e. a process in it, like systemd) may also change the oom_score_adj of some of its children. Meaning, when you change it on top of it, any in-container configuration is lost.

So, I think, we should treat oom_score_adj as "set on container start only".

If you disagree, please describe your use case in details.

(I'd like to have per-cgroup analog oom_score_adj, but there's no such thing. To me, this should be controlled indirectly, by setting higher memory limits for containers which you don't want to be OOM killed)

@Karthik-K-N
Copy link
Author

Sure Thanks, I can provide more context behind this use case

We are proposing a new KEP Node Resource Hot Plug where in which we are adding a feature to hotplug compute resources like memory, cpu to node.

Currently in Kubernetes for a container the oom_score_adj is calculated using below formula

1000 - (1000*containerMemReq)/memoryCapacity

where: 
containerMemReq: container memory request
memoryCapacity: Node's total memory

Consider a scenario where a node has few running pods and we hotplug resources to add memory to it.
The previously calculated oom_score_adj should updated since there is change in memoryCapacity of node.( idea is to make an update to the node be as close to what the node would have been original spun up with)

Example scenario

  1. Before resize where node has 8GB of memory.
Pod 1: 

Mem : 8GB 
Con: 4GB
OOM: 500

Pod 2:

Mem: 8GB 
Con: 2GB
OOM: 750

Pod 3:

Mem: 8GB 
Con: 1GB
OOM: 875
  1. Post resize to 16G with new set of pods with similar request
Pod 4:

Mem: 16GB 
Con: 4GB
OOM: 750

Pod 5:

Mem: 16GB 
Con: 2GB
OOM: 875

In the above scenario Pod 3 and Pod 5 more prone to be killed in the situation where there is memory pressure.

However if recalculate the oom_score_adj post resize for the pods that are existed before:

Pod 1: 

Mem :16GB 
Con: 4GB
OOM: 750

Pod 2:

Mem: 16GB 
Con: 2GB
OOM: 875

Pod 3:

Mem: 16GB 
Con: 1GB
OOM: 937.5

POD 3 likely to be killed first in-case of memory pressure.

Post resize without recalculation oom_score_adj, Newly created pods are more prone to eviction due to higher OOM score.

Please let me know is there any better way to handle this scenario.

@tallclair
Copy link

I'm afraid there is an issue adjusting OOM score for a running container.

When we start a new container, we set oom_score_adj for container init (pid 1) process, which is responsible for running all other container's processes. Those other processes are all childen of init and will naturally derive init's oom score.

When you set oom_score_adj for a running container, you set it for container init only, and possibly processes which will be started later. This means the processes that are are already running won't get the oom score new value.

Does setting memory.oom.group change this behavior? Do you know what oom_score_adj is used in this case?

@tallclair
Copy link

To add another use case: for in-place pod resize (https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources), we allow Kubernetes memory requests to be adjusted without restart, and the oom_score_adj is calculated from memory requests, so we'd like to be able to update it.

To the comment above, I'd be fine with saying we only support it with cgroups v2 when the group oom-killer behavior is set (the default on k8s with cgroups v2).

@kolyshkin
Copy link
Contributor

kolyshkin commented Mar 13, 2025

Please let me know is there any better way to handle this scenario.

Since the oom_score_adj values only make sense in proportion to each other, and they can't be easily changed when the pod is running, I think the best strategy to handle it is to NOT use the current RAM value in calculation, but rather use something constant. That constant can actually be the initial RAM size (or a value derived from it, such as a 2xRAM or 0.5xRAM).

The only thing you need to maybe add is to add a minimum value to the formula so the oom_score_adj stay above 0 (keeping the negative values to protect important system tasks (like ssh) that should not be killed).

So, here's the new formula:

oom_score := min(0, 1000 - (1000*containerMemoryRequest)/initialMemoryCapacity)

With that, here's what we have before the resize:

POD Memory, GB OOM score adj
1 4 500
2 2 750
3 1 875

And after the resize:

POD Memory, GB OOM score adj
1 4 500
2 2 750
3 1 875
4 4 500
5 2 750

Now, what if instead of pods 4 and 5 we add a pod with 8 GB of RAM?

POD Memory, GB OOM score adj
1 4 500
2 2 750
3 1 875
4 8 0

@Karthik-K-N
Copy link
Author

Please let me know is there any better way to handle this scenario.

Since the oom_score_adj values only make sense in proportion to each other, and they can't be easily changed when the pod is running, I think the best strategy to handle it is to NOT use the current RAM value in calculation, but rather use something constant. That constant can actually be the initial RAM size (or a value derived from it, such as a 2xRAM or 0.5xRAM).

The only thing you need to maybe add is to add a minimum value to the formula so the oom_score_adj stay above 0 (keeping the negative values to protect important system tasks (like ssh) that should not be killed).

So, here's the new formula:

oom_score := min(0, 1000 - (1000*containerMemoryRequest)/initialMemoryCapacity)

With that, here's what we have before the resize:
POD Memory, GB OOM score adj
1 4 500
2 2 750
3 1 875

And after the resize:
POD Memory, GB OOM score adj
1 4 500
2 2 750
3 1 875
4 4 500
5 2 750

Now, what if instead of pods 4 and 5 we add a pod with 8 GB of RAM?
POD Memory, GB OOM score adj
1 4 500
2 2 750
3 1 875
4 8 0

Thank you for the great idea, Somehow it didn't cross our minds as we were only exploring the aspect of recalculating and updating the oom_score_adj.

We shall also check this in sig-node for their opinion on this.

@kolyshkin
Copy link
Contributor

Does setting memory.oom.group change this behavior? Do you know what oom_score_adj is used in this case?

memory.oom.cgroup is kind of orthogonal. OOM killer always operate on the process level (not on the container/cgroup/pod level) when choosing the process to kill. But, if memory.oom.cgroup is set for a cgroup the process to be killed is in, then the kernel kills all the processes in this cgroup.

Depending on a use case, it may or may not make sense, this is why it's a per-cgroup knob.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants