Fix CPU and memory affinity under external resource management #3012

askervin · 2025-02-11T10:34:14Z

Fixes CPU affinity when running inference on CPU, and when CPUs are externally managed using taskset, numactl, cgroups, Kubernetes CPU manager, NRI resource policy plugins, for instance.
Detect external CPU management and trust the external CPU manager completely. It is more likely that external manager has the big picture of all other tasks running on the system, their QoS, hardware characteristics, etc.
For instance, do not modify even memory affinity, because the external manager may know better which NUMA node has fastest memory, or which NUMA nodes have enough free memory for this inference.

- Fixes CPU affinity when running inference on CPU, and when CPUs are externally managed using taskset, numactl, cgroups, Kubernetes CPU manager, NRI resource policy plugins, for instance. - Detect external CPU management and trust the external CPU manager completely. It is more likely that external manager has the big picture of all other tasks running on the system, their QoS, hardware characteristics, etc. - For instance, do not modify even memory affinity, because the external manager may know better which NUMA node has fastest memory, or which NUMA nodes have enough free memory for this inference. Fixes: huggingface#3011 Signed-off-by: Antti Kervinen <[email protected]>

askervin · 2025-02-13T15:04:45Z

@OlivierDehaene, @Narsil, @sywangyi, would you have time to check out this PR and related issue, please?

sywangyi · 2025-02-14T12:40:09Z

server/text_generation_server/models/flash_causal_lm.py

@@ -102,6 +102,39 @@ def get_sliding_windows() -> int:


 def init_cpu_threads_env(rank_id: int, world_size: int):
+    import psutil
+    allowed_cpus = psutil.Process().cpu_affinity()
+    if len(allowed_cpus) < psutil.cpu_count(logical=True):


is allowed cpus logical cpu numbers? why is physical cpu numbers not used?

First, we want to detect if there is CPU affinity already enforced to this process. In other words, we want to know if somebody (an external entity) has already chosen the CPUs that this process can use.

In order to check this, on line 106 we fetch currently allowed CPUs from psutil.Process().cpu_affinity(). It returns every logical CPU that this process is allowed to use.

On line 107 we compare this number to all logical CPUs in the system. (Comparing to physical CPU cores would not make any sense.) If there are less CPUs allowed to this process than there are logical CPUs in the system, we conclude that there is an external CPU manager that takes care of CPU affinity, and use exactly those CPUs that it has chosen. On the other hand, if there is no CPU manager, we assume that we should choose CPUs by ourselves, and continue with the logic that was implemented in your patch, @sywangyi.

The new logic is here: if somebody has already chosen CPUs for this process, then this process uses all of them. We trust that the external entity knows what is the best way to run this process on each hardware. If a user wants to use every hyperthread of certain CPU cores (for instance by using taskset -c ...) then we adapt to that. And if the user wants to run this process only on one hyperthread of every CPU core, then the user can choose logical CPUs correspondingly.

The reason for choosing this logic is: if we do not use all allowed CPUs, or if we continue to set memory affinity only on the node with our allowed CPUs, this process prevents running itself optimally on various platforms. There are cases were we it's best to use hyperthreads, and there are cases where it's best to use memory from a CPU-less NUMA node. This process must not pretend it knows better how to run on every platform unless it really does.

This only works with --cpuset and doesn't seem to work with other forms of CPU control (like kube requests/limits) right ?

For reference cuda/rocm have a unified env variable used across the ecosystem to ask for visibility on the devices. Is there potentially such a variable that we could reuse here ?

In general both functions should probably be merged together. It seems we're only changing how we resolve the node_id here, right ? (So without the subset it could resolve to some range(n)).

Sidenote: Are there any reasons to use anything else as a config than the max number of threads per CPU (eventually reduced to some subset of the actual machine's physical CPUs) ? Are there benefits to using less threads than the max allowed, or more ?

Kube requests/limits

This PR works with Kubernetes:

when using the static CPU manager policy

when using NRI policies 1 2 plugged into containerd or CRI-O.

In these cases, Kubernetes requests/limits affect the number of allowed CPUs. Details on how many CPUs exactly and which one depend on the CPU manager or NRI policy configuration, hardware topology and other workloads running on the same server. All these policies eventually make affect through cpuset.cpus controller in cgroups.

For reference cuda/rocm have a unified env variable used across the ecosystem to ask for visibility on the devices. Is there potentially such a variable that we could reuse here ?

I’d separate two levels here:

Get the allowed CPUs from the platform. On Linux, the most reliable method for that is sched_getaffinity(), wins reading /sys/fs/cgroup or even /proc/pid/status. This is used behind the scenes by psutil. This works similarly in all cases: taskset, numactl, docker --cpuset-cpus, and the above-mentioned Kubernetes resource managers. There is no need for environment variables.

Introduce workload-specific variables that tell how allowed resources must be used. These are non-standard: there is no common practice for communicating which CPU and memory must/should/could be used. In vLLM, for instance, they use LLM_CPU_OMP_THREADS_BIND environment variable, with which user can specify, for instance, “run shard-0 on CPUs 2-7 and shard-1 on CPUs 8-13” with syntax 2-7|8-13.

I agree that going to Level 2 makes sense at some point. In fact, I’m interested in prototyping text-generation-server performance so that we would tell the kernel to spread memory pages for the model across several memory controllers. This may improve performance due to more available memory bandwidth. For that, I’ll anyway have to introduce an environment variable in any case, because there is no common way (syscall and cgroups) to enforce memory policy like there is for CPU affinity.

If you like it, @Narsil, I could see if I could fit support for LLM_CPU_OMP_THREADS_BIND nicely, too. What do you think?

To keep changes comprehensible, focused and well justified, I’d prefer having this PR as a minimal bug fix to prevent resource underutilization and allow assigning multiple inferences on the same server with good performance. I can write another PR with environment variables.

In general both functions should probably be merged together. It seems we're only changing how we resolve the node_id here, right ? (So without the subset it could resolve to some range(n)).

There is a fundamental difference between the two functions, and in the two cases in general:

[New] Someone who knows the platform and other workloads better than us has chosen a set of CPUs and memories for us. Use everything without questioning the choice. For instance, use both hyperthreads from every physical CPU core, or use 24 CPU cores from one socket and 12 from another socket, if that is want was given.

[Old] Think that we have the whole system for running this single inference. Nobody knows better how to use resources here. Feel free to do our own tricks, like using only CPUs from one single socket.

I'm afraid that handling these quite different cases in the same function would be a bit messy. At least it got messy when I first tried implementing this fix so. There were unnecessarily many control paths through without that many common elements after all.

Furthermore, think of future fixes. For instance, supporting running on Xeons with SNC enabled. (SNC exposes every socket as multiple NUMA nodes. However, memory access across sub-NUMA nodes is way faster if they are on the same socket vs different sockets.) If we want to support optimal automatic performance in this case, the [Old] function needs to be modified so that it detects CPUs and memories are attached to the same socket, yet the OS tells they are in separate NUMA nodes. Currently [Old] function does not do it, so it uses only a fraction of socket’s CPUs and memory bandwidth when SNC is enabled. My point is: it'll be cleaner to implement this kind of fixes to Case 2 [Old] function, and keep Case 1 [New] untouched. The Case 2 function will absorb most if not all internal understanding of various platforms and how to run fast on them.

Sidenote: Are there any reasons to use anything else as a config than the max number of threads per CPU (eventually reduced to some subset of the actual machine's physical CPUs) ? Are there benefits to using less threads than the max allowed, or more ?

The easiest case first: more threads than (logical) CPUs? I cannot imagine where this would make sense. I’ve seen horrible performance (futex storms with little progress) when there have been more than one thread per logical CPU. That’s one reason for this PR to exist.

On the other hand, using less threads than there are allowed CPUs, is quite possible. Inferences are not the only things that are running on servers with hundreds of CPUs and terabytes of memory. CPUs share caches. For instance, consider a case where every block of 4 CPUs would have a common L2 cache. It might be optimal to use only one CPU per block to run something memory/cache intensive, like text-generation-server, and put only compute-intensive tasks on the remaining CPUs of each block. In this split these tasks would not interfere much each other and every CPU running inference would have maximal L2 cache, nevertheless CPUs would not be wasted. And there are even CPU designs where a block of 8 CPUs share single FPU.

Therefore, let’s try to end up with a design where users have enough power to specify the alignment they wish. Your point on environment variables (LLM_CPU_OMP_THREADS_BIND, maybe something on memory policy, too) is definitely valid for that.

sywangyi reviewed Feb 14, 2025

View reviewed changes

sywangyi approved these changes Feb 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CPU and memory affinity under external resource management #3012

Fix CPU and memory affinity under external resource management #3012

askervin commented Feb 11, 2025

askervin commented Feb 13, 2025

sywangyi Feb 14, 2025

askervin Feb 17, 2025

Narsil Feb 18, 2025

askervin Feb 18, 2025

Fix CPU and memory affinity under external resource management #3012

Are you sure you want to change the base?

Fix CPU and memory affinity under external resource management #3012

Conversation

askervin commented Feb 11, 2025

askervin commented Feb 13, 2025

sywangyi Feb 14, 2025

Choose a reason for hiding this comment

askervin Feb 17, 2025

Choose a reason for hiding this comment

Narsil Feb 18, 2025

Choose a reason for hiding this comment

askervin Feb 18, 2025

Choose a reason for hiding this comment