Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to build TensorRT-LLM engine on host and deploy to Jetson Orin Nano Super? #3149

Open
Sesameisgod opened this issue Mar 29, 2025 · 17 comments
Assignees
Labels
question Further information is requested triaged Issue has been triaged by maintainers

Comments

@Sesameisgod
Copy link

Hi, I’m currently working with TensorRT-LLM and trying to deploy a model (e.g., Qwen2-VL-2B-Instruct) on a Jetson Orin Nano Super. However, due to limited memory on the Nano, I’m unable to build the TensorRT engine directly on the device.

Is there any official or recommended approach to build the TensorRT-LLM engine on a more powerful host machine (with sufficient memory and GPU), and then transfer the generated engine file to the Jetson Orin Nano Super for inference?

If so, are there any considerations or compatibility issues I should be aware of when cross-building the engine on x86 and deploying it on Jetson (aarch64)?

Thanks in advance!

@juney-nvidia
Copy link
Collaborator

@Sesameisgod
Hi, TensorRT-LLM has two backends now, one based on TensorRT(the first workflow supported in TensorRT-LLM) and the other based on PyTorch(the new supported workflow since 0.17 release).

For TensorRT workflow, it requires AoT(ahead-of-time) phase tuning phase to select the best combination of kernel sequence, so though technically it is possible to build the TensorRT engine on another GPU sharing the similar HW architecture, it is not the most recommended way.

June

@juney-nvidia juney-nvidia self-assigned this Mar 29, 2025
@juney-nvidia juney-nvidia added question Further information is requested triaged Issue has been triaged by maintainers labels Mar 29, 2025
@juney-nvidia
Copy link
Collaborator

  • @sunnyqgg for vis in case she may have more inputs on this question.

June

@Sesameisgod
Copy link
Author

Thank you for your response!

I’d like to follow up and ask — is there any recommended approach for building a TensorRT engine for Qwen2-VL-2B-Instruct directly on a Jetson Orin Nano Super (8GB RAM)?

I’ve tested running the model via Hugging Face Transformers on the Nano and it works successfully, which suggests that the model can run on the device.

However, the OOM issue occurs during the TensorRT engine building phase, are there any strategies (e.g., using swap) to make engine building feasible directly on the Nano?

@juney-nvidia
Copy link
Collaborator

juney-nvidia commented Mar 30, 2025

@Sesameisgod I am not aware that TensorRT engine building process can use swap memory during the offline engine building process.

An alternative way is that you can try to run Qwen2-VL model in the newly introduced PyTorch workflow:

TensorRT-LLM PyTorch workflow has been introduced since 0.17 release and based on our internal performance evaluation, on popular models like LLaMA/Mistral/Mixtral, the PyTorch workflow performance is on-par(or even faster) with the TensorRT workflow, since the customized kernels are reused in both TensorRT workflow(as plugins) and PyTorch workflow(as torch custom op), also the existing C++ runtime building blocks(BatchManager/KV CacheManager/etc.) are also reused in both TensorRT and PyTorch workflow and more optimizations are added into the PyTorch workflow.

Currently we also shift more attention to the enhancement of the PyTorch workflow, such as the recently announced DeepSeek R1 performance numbers are all based on the PyTorch workflow.

What we cannot commit now is that up to now due to bandwidth limitation, we cannot commit official support for Jetson platform yet. So you need to try to run TensorRT-LLM on Jetson to observe the behavior.

Thanks
June

@Sesameisgod
Copy link
Author

Got it, I’ll try running Qwen2-VL with the PyTorch workflow on the Jetson Orin Nano Super and see how it performs.
Really appreciate your help!

@juney-nvidia
Copy link
Collaborator

@Sesameisgod to ensure you are aware of this Qwen2.5-VL effort from @yechank-nvidia

https://github.com/NVIDIA/TensorRT-LLM/pull/3156/files

Thanks
June

@sunnyqgg
Copy link
Collaborator

Hi @Sesameisgod

  1. You can use swap memory during engine building process, but in my experience, if you allocate over 8GB, the system locks up. The TRT engine generation requires about 4 times the memory of the model size.
  2. You can try to build W4A16(INT4) model engine.
  3. As we discussed below, you can use the same TRT and TRT-LLM with Jetson orin(64GB) to build the engine and run it with Jetson Orin Nano Super
  4. For saving memory usage in the inference phase, you can use -mmap, pls refer https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0-jetson/README4Jetson.md#3-reference-memory-usage
  5. Actually we have branch for Jeston device:https://github.com/NVIDIA/TensorRT-LLM/tree/v0.12.0-jetson, but unfortunately it doesn't support Qwen2-VL.

Thanks,
Sunny.

@Sesameisgod
Copy link
Author

@juney-nvidia
That's great to hear! It's really nice to see support for the latest models coming so quickly. Qwen is truly an impressive VLM series.

@sunnyqgg
Understood, thank you. For now, we're not considering quantization yet. We're planning to explore building the engine on a Jetson device with 64GB of memory — so far, it seems like Jetson AGX Orin is the only one that meets this requirement, and we’re planning to purchase one for testing.

Also, as you mentioned, TensorRT-LLM v0.12 doesn't support Qwen2-VL, so I'm currently using the aarch64 Docker image from this repository. After running the container and executing the import tensorrt_llm command, the version appears correctly, so it seems to be running, though further testing is needed to confirm whether everything works properly.

If there’s any progress later on, I’ll be happy to share it here. Thanks again for your help!

@xiaohuihuige
Copy link

Hi @Sesameisgod , I am also doing the same thing as you, but I encountered a big problem. My board is Jetson NX (8G), and I ran Vila-3b (mlc), which is quite fast, but my understanding ability is a bit poor. Now I am trying to run qwen2.5vl and qwen2vl. Tensor-LLM only provided Jetson with v0.12 version, and the configuration environment is very troublesome for me. Unfortunately, your Docker that talks about Tensor LLM cannot be pulled in China

@Sesameisgod
Copy link
Author

Hi @xiaohuihuige ,
Have you tried using a VPN? or maybe I can share the Docker image with you.
However, I'm not quite sure what's the best way to transfer it.

@garvitpathak
Copy link

Hi @Sesameisgod ,if pull it on my jetson orin nano 8gb can i convert the qwen 2 vl 2b to tensorrt and run the inference over it?

@garvitpathak
Copy link

Hi @Sesameisgod , can you tell me how you install latest version of tensort llm on your jetson orin nano?

@Sesameisgod
Copy link
Author

Hi @garvitpathak,

I didn’t install the latest version of TensorRT-LLM directly on the Jetson Orin Nano. Instead, I used the ARM64 image provided by Trystan on Docker Hub (https://hub.docker.com/r/trystan/tensorrt_llm/tags). You can start by pulling the image with:

docker pull trystan/tensorrt_llm:aarch64-0.17.0.post1_90

Then, I used jetson-containers to run the container, which saves the trouble of manually setting a lot of parameters. You can do it like this:

# Navigate to the jetson-containers repo
cd jetson-containers
# Use run.sh to start the container
./run.sh trystan/tensorrt_llm:aarch64-0.17.0.post1_90

Once you're inside the container, try running:

python3 -c "import tensorrt_llm"

If everything is set up correctly, it should print the version number of TensorRT-LLM (which is 0.17.0 in this container).

The official guide (https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) also includes instructions on how to build the image from source, but I haven’t tried that on the Jetson Orin Nano yet.

Unfortunately, as I mentioned earlier, I ran into an OOM issue and couldn’t successfully convert the original precision Qwen2 VL model to an .engine file. I also haven’t tested whether inference actually works — so far, I’ve only confirmed that importing tensorrt_llm succeeds. Maybe you can try building an INT4 model as suggested by sunnyqgg.

That’s my current progress — hope it helps!

@garvitpathak
Copy link

garvitpathak commented Apr 9, 2025

Hi @Sesameisgod , Can you help me with this jetson orin nano 8GB, as int 4 is not available for qwen 2 vl 2b if yes can you provide the link to me for conversion?

@garvitpathak
Copy link

garvitpathak commented Apr 11, 2025

Hi @xiaohuihuige ,
I am trying to run the vila 3b but facing llama lava error key error while converting to tensorrt for quantisation can you help me with that?

@Sesameisgod
Copy link
Author

Hi @garvitpathak,
The official team has released an INT4 version of the model using GPTQ on Hugging Face (https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4), but I haven’t personally tried this model yet.

BTW, I’ve successfully deployed a Docker image with the latest version of TensorRT-LLM on the Jetson Orin Nano using the official guide (https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
For reference, I’m using a 1TB SSD as the system drive along with 64GB of SWAP.

@garvitpathak
Copy link

Hi @Sesameisgod can you tell me how you create a swap memory. A step by step instructions to do so?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants