Resource underutilization, thread thrashing: CPU affinity ignores allowed CPUs and cannot be switched off #3011

askervin · 2025-02-11T10:14:37Z

System Info

CPU affinity implementation (introduced in commit 59922f9, first released in v2.3.0, until current HEAD (4b8cda6) ignores already existing CPU pinning for the process.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Try to run inference instance on CPU so that you set CPU and memory affinity to some custom value (for instance 8 CPUs from socket 1). CPU and memory affinity can be externally managed using taskset, numactl, cgroups cpuset controller, docker, Kubernetes CPU manager, NRI resource policies, for instance.

Despite the affinity you choose, this implementation tries to use all CPUs in one NUMA node --- even if that is not allowed by the OS, and even if you wanted to run the process on CPUs of another NUMA node.

This issue prevents running several inference instances on the same system because they cannot be assigned to separate sockets, (sub)NUMA nodes, or any other disjoint CPU sets. However, running several instances on multi-socket systems significantly increase total token throughput per system, and on the other hand, several instances even on single-socket system allows serving more customers with reasonable latency, given that instances run on disjoint CPU sets.

This issue also prevents platform-specific optimization of single inference instance, as implicit assumptions are not optimal on every platform. For instance, the implementation prevents using high-bandwidth memory (HBM) directly built into some CPU chips (Xeon MAX). This happens because HBM is visible as a CPU-less NUMA node, and the current implementation sets memory affinity only on the NUMA node that include CPUs from the first socket. Also, if sub-NUMA clustering (SNC) is on, this implementation uses only a fraction of CPUs available in a socket.

Expected behavior

If CPU and memory affinity has been already set, this inference instance must detect and respect those restrictions, and adjust its own behavior accordingly. It cannot use other than allowed CPUs or memories anyway.

The text was updated successfully, but these errors were encountered:

- Fixes CPU affinity when running inference on CPU, and when CPUs are externally managed using taskset, numactl, cgroups, Kubernetes CPU manager, NRI resource policy plugins, for instance. - Detect external CPU management and trust the external CPU manager completely. It is more likely that external manager has the big picture of all other tasks running on the system, their QoS, hardware characteristics, etc. - For instance, do not modify even memory affinity, because the external manager may know better which NUMA node has fastest memory, or which NUMA nodes have enough free memory for this inference. Fixes: huggingface#3011 Signed-off-by: Antti Kervinen <[email protected]>

eero-t · 2025-02-11T14:30:38Z

CPU affinity implementation (since v2.3.0 until current HEAD (4b8cda6) ignores already existing CPU pinning for the process.

The indicated commit is HEAD from 4 days ago, not the commit regressing things in v2.3.0?

askervin · 2025-02-11T15:07:09Z

CPU affinity implementation (since v2.3.0 until current HEAD (4b8cda6) ignores already existing CPU pinning for the process.

The indicated commit is HEAD from 4 days ago, not the commit regressing things in v2.3.0?

Thanks @eero-t, edited the first line to include the commit where this implementation was introduced!

eero-t · 2025-02-11T15:25:49Z

Do you have numbers on how significant the perf slowdown can be compared to proper affinity control?

If yes, maybe that info could be right in title, as motivation: "50% slowdown: CPU affinity ..."?

askervin · 2025-02-11T16:09:37Z

In token throughput performance point of view, in a two-socket system this issue prevents gaining 2x throughput, or in a four-socket system gaining 4x throughput, that would be both achieved by running multiple inference instances with proper affinity.

If considering overall performance of all the various workloads running on the same server, this issue wastes resources that other workloads could use by expecting it can take all CPUs from single socket. However, it is realistic to get > 90 % of the token throughput with only 20 % of the CPUs in a socket (depending on core count, memory bandwidth, etc.). It depends on the other workloads what kind of performance gain they can produce if there are 80 % of CPUs in one socket available for use.

Editing title on this basis...

askervin linked a pull request Feb 11, 2025 that will close this issue

Fix CPU and memory affinity under external resource management #3012

Open

askervin changed the title ~~CPU affinity ignores allowed CPUs and cannot be switched off~~ Resource underutilization, thread thrashing: CPU affinity ignores allowed CPUs and cannot be switched off Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource underutilization, thread thrashing: CPU affinity ignores allowed CPUs and cannot be switched off #3011

Resource underutilization, thread thrashing: CPU affinity ignores allowed CPUs and cannot be switched off #3011

askervin commented Feb 11, 2025 •

edited

Loading

eero-t commented Feb 11, 2025

askervin commented Feb 11, 2025

eero-t commented Feb 11, 2025

askervin commented Feb 11, 2025

Resource underutilization, thread thrashing: CPU affinity ignores allowed CPUs and cannot be switched off #3011

Resource underutilization, thread thrashing: CPU affinity ignores allowed CPUs and cannot be switched off #3011

Comments

askervin commented Feb 11, 2025 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

eero-t commented Feb 11, 2025

askervin commented Feb 11, 2025

eero-t commented Feb 11, 2025

askervin commented Feb 11, 2025

askervin commented Feb 11, 2025 •

edited

Loading