-
Notifications
You must be signed in to change notification settings - Fork 369
Cpu memory optimization #3845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Cpu memory optimization #3845
Changes from 1 commit
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
754743b
Enabled Qwen MoE with 1 layer. Rewrote index_put converter
cehongwang 2140c49
fixed the perf issue in the lowering pass
cehongwang a016bc0
Optimized index converter
cehongwang 6ea89ae
Fixed a typo in the converter. Covered the discontinuous tests
cehongwang c286767
Supported bool mask indicies
cehongwang 2540824
Delete one copy
cehongwang c7f8b12
Added an example that can compile on A40 with this PR but cannot unde…
cehongwang 711446c
Commented out for NVBug people to debug
cehongwang 35d5861
Reduced memory usage of use_python_runtime=True with the new API
cehongwang 503f320
ready for review
cehongwang 6b1950c
Revised according to comments
cehongwang 1e2e669
Cleared 2x+ dangling memory after compilation
cehongwang 33ca588
Added testcases and try catch
cehongwang d99f183
Revert back to support lazy init while reducing the memory consumption
cehongwang 66b40bd
Added a potential solution for windows
cehongwang 880b639
Revert windows solution. Not working
cehongwang fddc075
Added engine back-to-back break for CPU memory optimization
cehongwang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,7 @@ | ||
| import logging | ||
| from typing import Collection, Dict, List, Optional, Tuple | ||
|
|
||
| import psutil | ||
| import torch | ||
| import torch.fx.passes.operator_support as ops | ||
| from torch.fx.node import Target | ||
|
|
@@ -225,13 +226,80 @@ def partition_graph(self) -> torch.fx.GraphModule: | |
| # Remove segments smaller than the block size (with exceptions) | ||
| subgraphs = self.remove_small_acc_subgraphs(subgraphs) | ||
|
|
||
| num_of_break = self.calculate_num_of_break(subgraphs) | ||
| subgraphs = self.break_subgraphs(subgraphs, num_of_break=num_of_break) | ||
|
|
||
| # Set the number of TRT engines to be generated | ||
| self.num_trt_accelerated_subgraphs = len([s for s in subgraphs if s.is_acc]) | ||
|
|
||
| # Tag the accelerated nodes and split the graph accordingly | ||
| self.tag(subgraphs) | ||
| return self.split() | ||
|
|
||
| def calculate_num_of_break(self, subgraphs: List[Subgraph]) -> int: | ||
| """ | ||
| This function calculates the break period based on the number of subgraphs. | ||
| """ | ||
| rss = psutil.Process().memory_info().rss | ||
| available_rss = psutil.virtual_memory().available | ||
| num_of_graphs = len(subgraphs) | ||
| if rss < available_rss * 0.3: | ||
| num_of_graphs = 1 | ||
| elif rss < available_rss * 0.5: | ||
| num_of_graphs = 2 | ||
| elif rss < available_rss: | ||
| num_of_graphs = 4 | ||
| elif rss < available_rss * 1.5: | ||
| num_of_graphs = 8 | ||
| elif rss < available_rss * 2: | ||
| num_of_graphs = 16 | ||
| else: | ||
| num_of_graphs = 32 | ||
|
|
||
| return max( | ||
| 1, num_of_graphs // ((len(subgraphs) + 1) // 2) | ||
| ) # If there are already graph breaks, for each TRT subgraph, we break for a few times. | ||
|
|
||
| def break_subgraphs( | ||
| self, subgraphs: List[Subgraph], num_of_break: int = 1 | ||
| ) -> List[Subgraph]: | ||
| """ | ||
| This function breaks the subgraphs into smaller subgraphs at the specified frequency to save CPU memory. | ||
| """ | ||
|
|
||
| num_of_sdpa_node = len( | ||
| [node for node in self.acc_nodes if "scaled_dot" in str(node.target)] | ||
| ) | ||
| break_period = num_of_sdpa_node // num_of_break + 1 | ||
| current_break_idx = 0 | ||
| current_num_break = 0 | ||
| new_subgraphs = [] | ||
| for subgraph in subgraphs: | ||
| if subgraph.is_acc: | ||
| for i, node in enumerate(subgraph.nodes): | ||
| if "scaled_dot" in str(node.target): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Its fine if we do this for testing but really we should be taking a much more generic approach rather than assuming only sdpa is a viable break point. |
||
| current_num_break += 1 | ||
| if current_num_break % break_period != 0: | ||
| continue | ||
| new_subgraphs.append( | ||
| Subgraph( | ||
| is_acc=True, | ||
| nodes=subgraph.nodes[current_break_idx : i + 1], | ||
| device_ordinal=subgraph.device_ordinal, | ||
| ) | ||
| ) | ||
| current_break_idx = i + 1 | ||
| new_subgraphs.append( | ||
| Subgraph( | ||
| is_acc=True, | ||
| nodes=subgraph.nodes[current_break_idx:], | ||
| device_ordinal=subgraph.device_ordinal, | ||
| ) | ||
| ) | ||
| else: | ||
| new_subgraphs.append(subgraph) | ||
| return new_subgraphs | ||
|
|
||
| def starter_nodes(self) -> Tuple[NodeSet, NodeSet]: | ||
| """Generates starter nodes for partitioning + segmentation""" | ||
| # Starter accelerated nodes are all callable accelerated ops | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is too much of an heuristic based system. A better approach IMO is to calculate a graph size budget based on available memory (or eventually this could be user specified). Then of the TRT blocks we estimate its size and then decide how many subgraphs it should be split into to meet the budget