- 
                Notifications
    You must be signed in to change notification settings 
- Fork 368
cpu memory optimization rebased to main #3868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b9b6aeb    to
    51f64f0      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to add a link to the resource_management page in index.rst
| This shifts one model copy from GPU to CPU memory. | ||
| As a result, peak GPU memory usage decreases to about **1×** | ||
| the model size, while CPU memory usage increases by roughly **1×**. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing, can we say increases to roughly **2x** model size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot say that because it depends on what users choose other CPU memory optimization options. Can we say "one more copy of the model will occupy the CPU memory" to make it more clear?
|  | ||
| @needs_refit # type: ignore[misc] | ||
| def _insert_engine_to_cache(self, hash_val: str, serialized_engine: bytes) -> None: | ||
| def _insert_engine_to_cache(self, hash_val: str, engine: trt.ICudaEngine) -> None: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zewenli98 when do these calls run? will this conflict with the goal of keeping mem usage under 3x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do caching in a post processing step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like we can give the cache entry as one of the Interpreter Result fields
        
          
                py/torch_tensorrt/dynamo/partitioning/_adjacency_partitioner.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | settings: CompilationSettings = CompilationSettings(), | ||
| arg_inputs: Optional[Sequence[Input]] = None, | ||
| kwarg_inputs: Optional[dict[str, Any]] = None, | ||
| engine_cache: Optional[BaseEngineCache] = None, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to move the cache insert here
51f64f0    to
    f77df5a      
    Compare
  
    
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes # (issue)
Type of change
Please delete options that are not relevant and/or add your own.
Checklist: