-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to build TensorRT-LLM engine on host and deploy to Jetson Orin Nano Super? #3149
Comments
@Sesameisgod For TensorRT workflow, it requires AoT(ahead-of-time) phase tuning phase to select the best combination of kernel sequence, so though technically it is possible to build the TensorRT engine on another GPU sharing the similar HW architecture, it is not the most recommended way. June |
June |
Thank you for your response! I’d like to follow up and ask — is there any recommended approach for building a TensorRT engine for Qwen2-VL-2B-Instruct directly on a Jetson Orin Nano Super (8GB RAM)? I’ve tested running the model via Hugging Face Transformers on the Nano and it works successfully, which suggests that the model can run on the device. However, the OOM issue occurs during the TensorRT engine building phase, are there any strategies (e.g., using swap) to make engine building feasible directly on the Nano? |
@Sesameisgod I am not aware that TensorRT engine building process can use swap memory during the offline engine building process. An alternative way is that you can try to run Qwen2-VL model in the newly introduced PyTorch workflow:
TensorRT-LLM PyTorch workflow has been introduced since 0.17 release and based on our internal performance evaluation, on popular models like LLaMA/Mistral/Mixtral, the PyTorch workflow performance is on-par(or even faster) with the TensorRT workflow, since the customized kernels are reused in both TensorRT workflow(as plugins) and PyTorch workflow(as torch custom op), also the existing C++ runtime building blocks(BatchManager/KV CacheManager/etc.) are also reused in both TensorRT and PyTorch workflow and more optimizations are added into the PyTorch workflow. Currently we also shift more attention to the enhancement of the PyTorch workflow, such as the recently announced DeepSeek R1 performance numbers are all based on the PyTorch workflow. What we cannot commit now is that up to now due to bandwidth limitation, we cannot commit official support for Jetson platform yet. So you need to try to run TensorRT-LLM on Jetson to observe the behavior. Thanks |
Got it, I’ll try running Qwen2-VL with the PyTorch workflow on the Jetson Orin Nano Super and see how it performs. |
@Sesameisgod to ensure you are aware of this Qwen2.5-VL effort from @yechank-nvidia https://github.com/NVIDIA/TensorRT-LLM/pull/3156/files Thanks |
Hi @Sesameisgod,
Thanks, |
@juney-nvidia @sunnyqgg Also, as you mentioned, TensorRT-LLM v0.12 doesn't support Qwen2-VL, so I'm currently using the If there’s any progress later on, I’ll be happy to share it here. Thanks again for your help! |
Hi @Sesameisgod , I am also doing the same thing as you, but I encountered a big problem. My board is Jetson NX (8G), and I ran Vila-3b (mlc), which is quite fast, but my understanding ability is a bit poor. Now I am trying to run qwen2.5vl and qwen2vl. Tensor-LLM only provided Jetson with v0.12 version, and the configuration environment is very troublesome for me. Unfortunately, your Docker that talks about Tensor LLM cannot be pulled in China |
Hi @xiaohuihuige , |
Hi @Sesameisgod ,if pull it on my jetson orin nano 8gb can i convert the qwen 2 vl 2b to tensorrt and run the inference over it? |
Hi @Sesameisgod , can you tell me how you install latest version of tensort llm on your jetson orin nano? |
Hi @garvitpathak, I didn’t install the latest version of TensorRT-LLM directly on the Jetson Orin Nano. Instead, I used the ARM64 image provided by Trystan on Docker Hub (https://hub.docker.com/r/trystan/tensorrt_llm/tags). You can start by pulling the image with: docker pull trystan/tensorrt_llm:aarch64-0.17.0.post1_90 Then, I used jetson-containers to run the container, which saves the trouble of manually setting a lot of parameters. You can do it like this: # Navigate to the jetson-containers repo
cd jetson-containers
# Use run.sh to start the container
./run.sh trystan/tensorrt_llm:aarch64-0.17.0.post1_90 Once you're inside the container, try running: python3 -c "import tensorrt_llm" If everything is set up correctly, it should print the version number of TensorRT-LLM (which is 0.17.0 in this container). The official guide (https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) also includes instructions on how to build the image from source, but I haven’t tried that on the Jetson Orin Nano yet. Unfortunately, as I mentioned earlier, I ran into an OOM issue and couldn’t successfully convert the original precision Qwen2 VL model to an .engine file. I also haven’t tested whether inference actually works — so far, I’ve only confirmed that importing tensorrt_llm succeeds. Maybe you can try building an INT4 model as suggested by sunnyqgg. That’s my current progress — hope it helps! |
Hi @Sesameisgod , Can you help me with this jetson orin nano 8GB, as int 4 is not available for qwen 2 vl 2b if yes can you provide the link to me for conversion? |
Hi @xiaohuihuige , |
Hi @garvitpathak, BTW, I’ve successfully deployed a Docker image with the latest version of TensorRT-LLM on the Jetson Orin Nano using the official guide (https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html). |
Hi @Sesameisgod can you tell me how you create a swap memory. A step by step instructions to do so? |
Hi, I’m currently working with TensorRT-LLM and trying to deploy a model (e.g., Qwen2-VL-2B-Instruct) on a Jetson Orin Nano Super. However, due to limited memory on the Nano, I’m unable to build the TensorRT engine directly on the device.
Is there any official or recommended approach to build the TensorRT-LLM engine on a more powerful host machine (with sufficient memory and GPU), and then transfer the generated engine file to the Jetson Orin Nano Super for inference?
If so, are there any considerations or compatibility issues I should be aware of when cross-building the engine on x86 and deploying it on Jetson (aarch64)?
Thanks in advance!
The text was updated successfully, but these errors were encountered: