-
Notifications
You must be signed in to change notification settings - Fork 370
Description
Bug Description
CPU offloading is enabled by default in MTTM which causes device mismatch issues for embedding layers.
Eg:
Here is the code for VLM component of Groot model: https://github.com/NVIDIA/Isaac-GR00T/blob/main/gr00t/model/backbone/eagle2_hg_model/modeling_eagle2_5_vl.py#L235
Once the language model is compiled with MTTM, it is moved to CPU. So, this operation fails since input_ids
tensor is on the GPU while the embedding layer (self.embed_tokens
) is on CPU.
offload_module_to_cpu isn't supported in MTTM. So adding the support will fix this issue. The following works
if self.additional_settings.get("offload_module_to_cpu", False):
deallocate_module(self.original_model, delete_module=False)
But there are multiple places where deallocate_module
is being used which needs to be investigated.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment
Build information about Torch-TensorRT can be found by turning on debug messages
- Torch-TensorRT Version (e.g. 1.0.0):
- PyTorch Version (e.g. 1.0):
- CPU Architecture:
- OS (e.g., Linux):
- How you installed PyTorch (
conda
,pip
,libtorch
, source): - Build command you used (if compiling from source):
- Are you using local sources or building from archives:
- Python version:
- CUDA version:
- GPU models and configuration:
- Any other relevant information: