[WIP][DNM] Add support for updating oom_score_adj #4669

Karthik-K-N · 2025-03-10T08:59:11Z

Add support for updating oom_score_adj

Signed-off-by: karthik-k-n <[email protected]> rebaes

kolyshkin

I'm afraid there is an issue adjusting OOM score for a running container.

When we start a new container, we set oom_score_adj for container init (pid 1) process, which is responsible for running all other container's processes. Those other processes are all childen of init and will naturally derive init's oom score.

When you set oom_score_adj for a running container, you set it for container init only, and possibly processes which will be started later. This means the processes that are are already running won't get the oom score new value.

Karthik-K-N · 2025-03-12T14:47:35Z

@kolyshkin Thanks for the suggestion.
Provided we have a init process and few child process, If we want to update the child process oom_score_adj along with parent any pointers on how we can get the child process today so we can update them too.

kolyshkin · 2025-03-13T00:08:25Z

@kolyshkin Thanks for the suggestion. Provided we have a init process and few child process, If we want to update the child process oom_score_adj along with parent any pointers on how we can get the child process today so we can update them too.

The issue here is, the container itself (i.e. a process in it, like systemd) may also change the oom_score_adj of some of its children. Meaning, when you change it on top of it, any in-container configuration is lost.

So, I think, we should treat oom_score_adj as "set on container start only".

If you disagree, please describe your use case in details.

(I'd like to have per-cgroup analog oom_score_adj, but there's no such thing. To me, this should be controlled indirectly, by setting higher memory limits for containers which you don't want to be OOM killed)

Karthik-K-N · 2025-03-13T03:57:34Z

Sure Thanks, I can provide more context behind this use case

We are proposing a new KEP Node Resource Hot Plug where in which we are adding a feature to hotplug compute resources like memory, cpu to node.

Currently in Kubernetes for a container the oom_score_adj is calculated using below formula

1000 - (1000*containerMemReq)/memoryCapacity

where: 
containerMemReq: container memory request
memoryCapacity: Node's total memory

Consider a scenario where a node has few running pods and we hotplug resources to add memory to it.
The previously calculated oom_score_adj should updated since there is change in memoryCapacity of node.( idea is to make an update to the node be as close to what the node would have been original spun up with)

Example scenario

Before resize where node has 8GB of memory.

Pod 1: 

Mem : 8GB 
Con: 4GB
OOM: 500

Pod 2:

Mem: 8GB 
Con: 2GB
OOM: 750

Pod 3:

Mem: 8GB 
Con: 1GB
OOM: 875

Post resize to 16G with new set of pods with similar request

Pod 4:

Mem: 16GB 
Con: 4GB
OOM: 750

Pod 5:

Mem: 16GB 
Con: 2GB
OOM: 875

In the above scenario Pod 3 and Pod 5 more prone to be killed in the situation where there is memory pressure.

However if recalculate the oom_score_adj post resize for the pods that are existed before:

Pod 1: 

Mem :16GB 
Con: 4GB
OOM: 750

Pod 2:

Mem: 16GB 
Con: 2GB
OOM: 875

Pod 3:

Mem: 16GB 
Con: 1GB
OOM: 937.5

POD 3 likely to be killed first in-case of memory pressure.

Post resize without recalculation oom_score_adj, Newly created pods are more prone to eviction due to higher OOM score.

Please let me know is there any better way to handle this scenario.

tallclair · 2025-03-13T19:12:35Z

I'm afraid there is an issue adjusting OOM score for a running container.

When we start a new container, we set oom_score_adj for container init (pid 1) process, which is responsible for running all other container's processes. Those other processes are all childen of init and will naturally derive init's oom score.

When you set oom_score_adj for a running container, you set it for container init only, and possibly processes which will be started later. This means the processes that are are already running won't get the oom score new value.

Does setting memory.oom.group change this behavior? Do you know what oom_score_adj is used in this case?

tallclair · 2025-03-13T19:14:34Z

To add another use case: for in-place pod resize (https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources), we allow Kubernetes memory requests to be adjusted without restart, and the oom_score_adj is calculated from memory requests, so we'd like to be able to update it.

To the comment above, I'd be fine with saying we only support it with cgroups v2 when the group oom-killer behavior is set (the default on k8s with cgroups v2).

kolyshkin · 2025-03-13T19:33:42Z

Please let me know is there any better way to handle this scenario.

Since the oom_score_adj values only make sense in proportion to each other, and they can't be easily changed when the pod is running, I think the best strategy to handle it is to NOT use the current RAM value in calculation, but rather use something constant. That constant can actually be the initial RAM size (or a value derived from it, such as a 2xRAM or 0.5xRAM).

The only thing you need to maybe add is to add a minimum value to the formula so the oom_score_adj stay above 0 (keeping the negative values to protect important system tasks (like ssh) that should not be killed).

So, here's the new formula:

oom_score := min(0, 1000 - (1000*containerMemoryRequest)/initialMemoryCapacity)

With that, here's what we have before the resize:

POD	Memory, GB	OOM score adj
1	4	500
2	2	750
3	1	875

And after the resize:

POD	Memory, GB	OOM score adj
1	4	500
2	2	750
3	1	875
4	4	500
5	2	750

Now, what if instead of pods 4 and 5 we add a pod with 8 GB of RAM?

POD	Memory, GB	OOM score adj
1	4	500
2	2	750
3	1	875
4	8	0

Karthik-K-N · 2025-03-14T04:55:03Z

Please let me know is there any better way to handle this scenario.

Since the oom_score_adj values only make sense in proportion to each other, and they can't be easily changed when the pod is running, I think the best strategy to handle it is to NOT use the current RAM value in calculation, but rather use something constant. That constant can actually be the initial RAM size (or a value derived from it, such as a 2xRAM or 0.5xRAM).

The only thing you need to maybe add is to add a minimum value to the formula so the oom_score_adj stay above 0 (keeping the negative values to protect important system tasks (like ssh) that should not be killed).

So, here's the new formula:
oom_score := min(0, 1000 - (1000*containerMemoryRequest)/initialMemoryCapacity)
With that, here's what we have before the resize:
POD Memory, GB OOM score adj
1 4 500
2 2 750
3 1 875

And after the resize:
POD Memory, GB OOM score adj
1 4 500
2 2 750
3 1 875
4 4 500
5 2 750

Now, what if instead of pods 4 and 5 we add a pod with 8 GB of RAM?
POD Memory, GB OOM score adj
1 4 500
2 2 750
3 1 875
4 8 0

Thank you for the great idea, Somehow it didn't cross our minds as we were only exploring the aspect of recalculating and updating the oom_score_adj.

We shall also check this in sig-node for their opinion on this.

kolyshkin · 2025-03-14T20:37:54Z

Does setting memory.oom.group change this behavior? Do you know what oom_score_adj is used in this case?

memory.oom.cgroup is kind of orthogonal. OOM killer always operate on the process level (not on the container/cgroup/pod level) when choosing the process to kill. But, if memory.oom.cgroup is set for a cgroup the process to be killed is in, then the kernel kills all the processes in this cgroup.

Depending on a use case, it may or may not make sense, this is why it's a per-cgroup knob.

Add support for updating oom_score_adj

e18a918

Signed-off-by: karthik-k-n <[email protected]> rebaes

Karthik-K-N force-pushed the support-oom branch from ea63098 to e18a918 Compare March 10, 2025 09:38

kolyshkin reviewed Mar 10, 2025

View reviewed changes

tallclair mentioned this pull request Mar 13, 2025

[FG:InPlacePodVerticalScaling] Memory request resizing should be acted on kubernetes/kubernetes#130657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][DNM] Add support for updating oom_score_adj #4669

[WIP][DNM] Add support for updating oom_score_adj #4669

Karthik-K-N commented Mar 10, 2025 •

edited

Loading

kolyshkin left a comment

Karthik-K-N commented Mar 12, 2025

kolyshkin commented Mar 13, 2025

Karthik-K-N commented Mar 13, 2025

tallclair commented Mar 13, 2025

tallclair commented Mar 13, 2025

kolyshkin commented Mar 13, 2025 •

edited

Loading

Karthik-K-N commented Mar 14, 2025

kolyshkin commented Mar 14, 2025

[WIP][DNM] Add support for updating oom_score_adj #4669

Are you sure you want to change the base?

[WIP][DNM] Add support for updating oom_score_adj #4669

Conversation

Karthik-K-N commented Mar 10, 2025 • edited Loading

kolyshkin left a comment

Choose a reason for hiding this comment

Karthik-K-N commented Mar 12, 2025

kolyshkin commented Mar 13, 2025

Karthik-K-N commented Mar 13, 2025

tallclair commented Mar 13, 2025

tallclair commented Mar 13, 2025

kolyshkin commented Mar 13, 2025 • edited Loading

Karthik-K-N commented Mar 14, 2025

kolyshkin commented Mar 14, 2025

Karthik-K-N commented Mar 10, 2025 •

edited

Loading

kolyshkin commented Mar 13, 2025 •

edited

Loading