-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CPU and memory affinity under external resource management #3012
Open
askervin
wants to merge
1
commit into
huggingface:main
Choose a base branch
from
askervin:5ZQ_detect_external_cpu_affinity
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is allowed cpus logical cpu numbers? why is physical cpu numbers not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, we want to detect if there is CPU affinity already enforced to this process. In other words, we want to know if somebody (an external entity) has already chosen the CPUs that this process can use.
In order to check this, on line 106 we fetch currently allowed CPUs from psutil.Process().cpu_affinity(). It returns every logical CPU that this process is allowed to use.
On line 107 we compare this number to all logical CPUs in the system. (Comparing to physical CPU cores would not make any sense.) If there are less CPUs allowed to this process than there are logical CPUs in the system, we conclude that there is an external CPU manager that takes care of CPU affinity, and use exactly those CPUs that it has chosen. On the other hand, if there is no CPU manager, we assume that we should choose CPUs by ourselves, and continue with the logic that was implemented in your patch, @sywangyi.
The new logic is here: if somebody has already chosen CPUs for this process, then this process uses all of them. We trust that the external entity knows what is the best way to run this process on each hardware. If a user wants to use every hyperthread of certain CPU cores (for instance by using
taskset -c ...
) then we adapt to that. And if the user wants to run this process only on one hyperthread of every CPU core, then the user can choose logical CPUs correspondingly.The reason for choosing this logic is: if we do not use all allowed CPUs, or if we continue to set memory affinity only on the node with our allowed CPUs, this process prevents running itself optimally on various platforms. There are cases were we it's best to use hyperthreads, and there are cases where it's best to use memory from a CPU-less NUMA node. This process must not pretend it knows better how to run on every platform unless it really does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works with
--cpuset
and doesn't seem to work with other forms of CPU control (like kube requests/limits) right ?For reference cuda/rocm have a unified env variable used across the ecosystem to ask for visibility on the devices. Is there potentially such a variable that we could reuse here ?
In general both functions should probably be merged together. It seems we're only changing how we resolve the node_id here, right ? (So without the subset it could resolve to some
range(n)
).Sidenote: Are there any reasons to use anything else as a config than the max number of threads per CPU (eventually reduced to some subset of the actual machine's physical CPUs) ? Are there benefits to using less threads than the max allowed, or more ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR works with Kubernetes:
In these cases, Kubernetes requests/limits affect the number of allowed CPUs. Details on how many CPUs exactly and which one depend on the CPU manager or NRI policy configuration, hardware topology and other workloads running on the same server. All these policies eventually make affect through cpuset.cpus controller in cgroups.
I’d separate two levels here:
2-7|8-13
.I agree that going to Level 2 makes sense at some point. In fact, I’m interested in prototyping text-generation-server performance so that we would tell the kernel to spread memory pages for the model across several memory controllers. This may improve performance due to more available memory bandwidth. For that, I’ll anyway have to introduce an environment variable in any case, because there is no common way (syscall and cgroups) to enforce memory policy like there is for CPU affinity.
If you like it, @Narsil, I could see if I could fit support for LLM_CPU_OMP_THREADS_BIND nicely, too. What do you think?
To keep changes comprehensible, focused and well justified, I’d prefer having this PR as a minimal bug fix to prevent resource underutilization and allow assigning multiple inferences on the same server with good performance. I can write another PR with environment variables.
There is a fundamental difference between the two functions, and in the two cases in general:
I'm afraid that handling these quite different cases in the same function would be a bit messy. At least it got messy when I first tried implementing this fix so. There were unnecessarily many control paths through without that many common elements after all.
Furthermore, think of future fixes. For instance, supporting running on Xeons with SNC enabled. (SNC exposes every socket as multiple NUMA nodes. However, memory access across sub-NUMA nodes is way faster if they are on the same socket vs different sockets.) If we want to support optimal automatic performance in this case, the [Old] function needs to be modified so that it detects CPUs and memories are attached to the same socket, yet the OS tells they are in separate NUMA nodes. Currently [Old] function does not do it, so it uses only a fraction of socket’s CPUs and memory bandwidth when SNC is enabled. My point is: it'll be cleaner to implement this kind of fixes to Case 2 [Old] function, and keep Case 1 [New] untouched. The Case 2 function will absorb most if not all internal understanding of various platforms and how to run fast on them.
The easiest case first: more threads than (logical) CPUs? I cannot imagine where this would make sense. I’ve seen horrible performance (futex storms with little progress) when there have been more than one thread per logical CPU. That’s one reason for this PR to exist.
On the other hand, using less threads than there are allowed CPUs, is quite possible. Inferences are not the only things that are running on servers with hundreds of CPUs and terabytes of memory. CPUs share caches. For instance, consider a case where every block of 4 CPUs would have a common L2 cache. It might be optimal to use only one CPU per block to run something memory/cache intensive, like text-generation-server, and put only compute-intensive tasks on the remaining CPUs of each block. In this split these tasks would not interfere much each other and every CPU running inference would have maximal L2 cache, nevertheless CPUs would not be wasted. And there are even CPU designs where a block of 8 CPUs share single FPU.
Therefore, let’s try to end up with a design where users have enough power to specify the alignment they wish. Your point on environment variables (LLM_CPU_OMP_THREADS_BIND, maybe something on memory policy, too) is definitely valid for that.