Add Multi-GPU inference note in deployment apps

ainoam · web-flow · commit b885b5136807 · 2025-12-10T12:49:10.000+02:00
diff --git a/docs/webapp/applications/apps_embed_model_deployment.md b/docs/webapp/applications/apps_embed_model_deployment.md
@@ -92,7 +92,13 @@ values from the file, which can be modified before launching the app instance
 * **Instance name** - Name for the Embedding Model Deployment instance. This will appear in the instance list
 * **Service Project** - ClearML Project where your Embedding Model Deployment app instance will be stored
 * **Queue** - The [ClearML Queue](../../fundamentals/agents_and_queues.md#what-is-a-queue) to which the Embedding Model 
-Deployment app instance task will be enqueued (make sure an agent is assigned to it)
+Deployment app instance task will be enqueued. Make sure an agent is assigned to that queue.
+
+  :::tip Multi-GPU inference
+  To run multi-GPU inference, ensure the queue's pod specification (from the base template and/or `templateOverrides`) defines multiple GPUs. See [GPU Queues with Shared Memory](../../clearml_agent/clearml_agent_custom_workload.md#example-gpu-queues-with-shared-memory)
+  for an example configuration of a queue that allocates multiple GPUs and shared memory.
+  :::
+
 * **AI Gateway Route** - Select an available, admin-preconfigured route to use as the service endpoint. If none is selected, an ephemeral endpoint will be created.
 * **Model Configuration**
   * Model - A ClearML Model ID or a Hugging Face model name (e.g. `openai-community/gpt2`)
diff --git a/docs/webapp/applications/apps_llama_deployment.md b/docs/webapp/applications/apps_llama_deployment.md
@@ -88,8 +88,14 @@ values from the file, which can be modified before launching the app instance
 * **Service Project (Access Control)**: The ClearML project where the app instance is created. Access is determined by 
   project-level permissions (i.e. users with read access can use the app instance).
 * **Queue**: The [ClearML Queue](../../fundamentals/agents_and_queues.md#what-is-a-queue) to which the 
-  llama.cpp Model Deployment app instance task will be enqueued (make sure an agent is assigned to it)  
-**AI Gateway Route**: Select an available, admin-preconfigured route to use as the service endpoint. If none is selected, an ephemeral endpoint will be created.
+  llama.cpp Model Deployment app instance task will be enqueued. Make sure an agent is assigned to that queue. 
+
+  :::tip Multi-GPU inference
+  To run multi-GPU inference, ensure the queue's pod specification (from the base template and/or `templateOverrides`) defines multiple GPUs. See [GPU Queues with Shared Memory](../../clearml_agent/clearml_agent_custom_workload.md#example-gpu-queues-with-shared-memory)
+  for an example configuration of a queue that allocates multiple GPUs and shared memory.
+  :::
+
+* **AI Gateway Route**: Select an available, admin-preconfigured route to use as the service endpoint. If none is selected, an ephemeral endpoint will be created.
 * **Model Configuration**: Configure the behavior and performance of the model serving engine.   
   * CLI: Llama.cpp CLI arguments. If set, these arguments will be passed to Llama.cpp and all following entries will be 
   ignored, except for the `Model` field.  
diff --git a/docs/webapp/applications/apps_model_deployment.md b/docs/webapp/applications/apps_model_deployment.md
@@ -91,7 +91,13 @@ values from the file, which can be modified before launching the app instance
 * **Service Project (Access Control)**: The ClearML project where the app instance is created. Access is determined by 
   project-level permissions (i.e. users with read access can use the app).
 * **Queue**: The [ClearML Queue](../../fundamentals/agents_and_queues.md#what-is-a-queue) to which the vLLM Model Deployment app 
-instance task will be enqueued (make sure an agent is assigned to that queue)
+instance task will be enqueued. Make sure an agent is assigned to that queue.
+
+  :::tip Multi-GPU inference
+  To run multi-GPU inference, ensure the queue's pod specification (from the base template and/or `templateOverrides`) defines multiple GPUs. See [GPU Queues with Shared Memory](../../clearml_agent/clearml_agent_custom_workload.md#example-gpu-queues-with-shared-memory)
+  for an example configuration of a queue that allocates multiple GPUs and shared memory.
+  :::
+
 * **AI Gateway Route**: Select an available, admin-preconfigured route to use as the service endpoint. If none is selected, an ephemeral endpoint will be created.
 * **Model Configuration**: Configure the behavior and performance of the model engine.
   * Trust Remote Code: Select to set Hugging Face [`trust_remote_code`](https://huggingface.co/docs/text-generation-inference/main/en/reference/launcher#trustremotecode) 
diff --git a/docs/webapp/applications/apps_sglang.md b/docs/webapp/applications/apps_sglang.md
@@ -90,7 +90,13 @@ values from the file, which can be modified before launching the app instance
 * **Service Project - Access Control** - The ClearML project where the app instance is created. Access is determined by 
   project-level permissions (i.e. users with read access can use the app).
 * **Queue** - The [ClearML Queue](../../fundamentals/agents_and_queues.md#what-is-a-queue) to which the SGLang Model Deployment app 
-instance task will be enqueued (make sure an agent is assigned to that queue)
+instance task will be enqueued. Make sure an agent is assigned to that queue.
+
+  :::tip Multi-GPU inference
+  To run multi-GPU inference, ensure the queue's pod specification (from the base template and/or `templateOverrides`) defines multiple GPUs. See [GPU Queues with Shared Memory](../../clearml_agent/clearml_agent_custom_workload.md#example-gpu-queues-with-shared-memory)
+  for an example configuration of a queue that allocates multiple GPUs and shared memory.
+  :::
+
 * **AI Gateway Route** - Select an available, admin-preconfigured route to use as the service endpoint. If none is selected, an ephemeral endpoint will be created.
 * **Model** - A ClearML Model ID or a HuggingFace model name (e.g. `openai-community/gpt2`)
 * **Model Configuration**: Configure the behavior and performance of the language model engine. This allows you to