[#9640][feat] Migrate model registry to v2.0 format with composable configs (#9836)

tcherckez-nvidia · web-flow · commit 9f6abaf59f68 · 2025-12-19T05:30:02.000-08:00
Signed-off-by: Tal Cherckez &lt;127761168+tcherckez-nvidia@users.noreply.github.com&gt;
diff --git a/examples/auto_deploy/.gitignore b/examples/auto_deploy/.gitignore
@@ -6,3 +6,4 @@ benchmark_results.json
 *.yaml
 !nano_v3.yaml
 !nemotron_flash.yaml
+!model_registry/configs/*.yaml
diff --git a/examples/auto_deploy/model_registry/README.md b/examples/auto_deploy/model_registry/README.md
@@ -0,0 +1,160 @@
+# AutoDeploy Model Registry
+
+The AutoDeploy model registry provides a comprehensive, maintainable list of supported models for testing and coverage tracking.
+
+## Format
+
+**Version: 2.0** (Flat format with composable configurations)
+
+### Structure
+
+```yaml
+version: '2.0'
+description: AutoDeploy Model Registry - Flat format with composable configs
+models:
+- name: meta-llama/Llama-3.1-8B-Instruct
+  yaml_extra: [dashboard_default.yaml, world_size_2.yaml]
+
+- name: meta-llama/Llama-3.3-70B-Instruct
+  yaml_extra: [dashboard_default.yaml, world_size_4.yaml, llama-3.3-70b.yaml]
+
+- name: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+  yaml_extra: [dashboard_default.yaml, world_size_2.yaml, demollm_triton.yaml]
+```
+
+### Key Concepts
+
+- **Flat list**: Models are in a single flat list (not grouped)
+- **Composable configs**: Each model references YAML config files via `yaml_extra`
+- **Deep merging**: Config files are merged in order (later files override earlier ones)
+- **No inline args**: All configuration is in YAML files for reusability
+
+## Configuration Files
+
+Config files are stored in `configs/` subdirectory and define runtime parameters:
+
+### Core Configs
+
+| File | Purpose | Example Use |
+|------|---------|-------------|
+| `dashboard_default.yaml` | Baseline settings for all models | Always first in yaml_extra |
+| `world_size_N.yaml` | GPU count (1, 2, 4, 8) | Defines tensor_parallel_size |
+
+### Runtime Configs
+
+| File | Purpose |
+|------|---------|
+| `multimodal.yaml` | Vision + text models |
+| `demollm_triton.yaml` | DemoLLM runtime with Triton backend |
+| `simple_shard_only.yaml` | Large models requiring simple sharding
+
+### Model-Specific Configs
+
+| File | Purpose |
+|------|---------|
+| `llama-3.3-70b.yaml` | Optimized settings for Llama 3.3 70B |
+| `nano_v3.yaml` | Settings for Nemotron Nano V3 |
+| `llama-4-scout.yaml` | Settings for Llama 4 Scout |
+| `openelm.yaml` | Apple OpenELM (custom tokenizer) |
+| `gemma3_1b.yaml` | Gemma 3 1B (sequence length) |
+| `deepseek_v3_lite.yaml` | DeepSeek V3/R1 (reduced layers) |
+| `llama4_maverick_lite.yaml` | Llama 4 Maverick (reduced layers) |
+
+## Adding a New Model
+
+### Simple Model (Standard Config)
+
+```yaml
+- name: organization/my-new-model-7b
+  yaml_extra: [dashboard_default.yaml, world_size_2.yaml]
+```
+
+### Model with Special Requirements
+
+```yaml
+- name: organization/my-multimodal-model
+  yaml_extra: [dashboard_default.yaml, world_size_4.yaml, multimodal.yaml]
+```
+
+### Model with Custom Config
+
+1. Create `configs/my_model.yaml`:
+
+```yaml
+# Custom settings for my model
+max_batch_size: 2048
+kv_cache_free_gpu_memory_fraction: 0.95
+cuda_graph_config:
+  enable_padding: true
+```
+
+2. Reference it in `models.yaml`:
+
+```yaml
+- name: organization/my-custom-model
+  yaml_extra: [dashboard_default.yaml, world_size_8.yaml, my_model.yaml]
+```
+
+## Config Merging
+
+Configs are merged in order. Example:
+
+```yaml
+yaml_extra:
+  - dashboard_default.yaml    # baseline: runtime=trtllm, benchmark_enabled=true
+  - world_size_2.yaml         # adds: tensor_parallel_size=2
+  - openelm.yaml              # overrides: tokenizer=llama-2, benchmark_enabled=false
+```
+
+**Result**: `runtime=trtllm, tensor_parallel_size=2, tokenizer=llama-2, benchmark_enabled=false`
+
+## World Size Guidelines
+
+| World Size | Model Size Range | Example Models |
+|------------|------------------|----------------|
+| 1 | \< 2B params | TinyLlama, Qwen 0.5B, Phi-4-mini |
+| 2 | 2-15B params | Llama 3.1 8B, Qwen 7B, Mistral 7B |
+| 4 | 20-80B params | Llama 3.3 70B, QwQ 32B, Gemma 27B |
+| 8 | 80B+ params | DeepSeek V3, Llama 405B, Nemotron Ultra |
+
+## Model Coverage
+
+The registry contains models distributed across different GPU configurations (world sizes 1, 2, 4, and 8), including both text-only and multimodal models.
+
+**To verify current model counts and coverage:**
+
+```bash
+cd /path/to/autodeploy-dashboard
+python3 scripts/prepare_model_coverage_v2.py \
+    --source local \
+    --local-path /path/to/TensorRT-LLM \
+    --output /tmp/model_coverage.yaml
+
+# View summary
+grep -E "^- name:|yaml_extra:" /path/to/TensorRT-LLM/examples/auto_deploy/model_registry/models.yaml | wc -l
+```
+
+When adding or removing models, use `prepare_model_coverage_v2.py` to validate the registry structure and coverage.
+
+## Best Practices
+
+1. **Always include `dashboard_default.yaml` first** - it provides baseline settings
+1. **Always include a `world_size_N.yaml`** - defines GPU count
+1. **Add special configs after world_size** - they override defaults
+1. **Create reusable configs** - if 3+ models need same settings, make a config file
+1. **Use model-specific configs sparingly** - only for unique requirements
+1. **Test before committing** - verify with `prepare_model_coverage_v2.py`
+
+## Testing Changes
+
+```bash
+# Generate workload from local changes
+cd /path/to/autodeploy-dashboard
+python3 scripts/prepare_model_coverage_v2.py \
+    --source local \
+    --local-path /path/to/TensorRT-LLM \
+    --output /tmp/test_workload.yaml
+
+# Verify output
+cat /tmp/test_workload.yaml
+```
diff --git a/examples/auto_deploy/model_registry/configs/dashboard_default.yaml b/examples/auto_deploy/model_registry/configs/dashboard_default.yaml
@@ -0,0 +1,9 @@
+# Default configuration for all AutoDeploy dashboard tests
+# These are baseline settings that apply to all models unless overridden
+
+runtime: trtllm
+attn_backend: flashinfer
+compile_backend: torch-compile
+model_factory: AutoModelForCausalLM
+skip_loading_weights: false
+max_seq_len: 512
diff --git a/examples/auto_deploy/model_registry/configs/deepseek_v3_lite.yaml b/examples/auto_deploy/model_registry/configs/deepseek_v3_lite.yaml
@@ -0,0 +1,4 @@
+# Configuration for DeepSeek V3 and R1 with reduced layers
+# Full models are too large, so we test with limited layers
+model_kwargs:
+  num_hidden_layers: 10
diff --git a/examples/auto_deploy/model_registry/configs/demollm_triton.yaml b/examples/auto_deploy/model_registry/configs/demollm_triton.yaml
@@ -0,0 +1,4 @@
+# Configuration for DemoLLM runtime with Triton backend
+# Used for experimental or specific model requirements
+runtime: demollm
+attn_backend: triton
diff --git a/examples/auto_deploy/model_registry/configs/gemma3_1b.yaml b/examples/auto_deploy/model_registry/configs/gemma3_1b.yaml
@@ -0,0 +1,3 @@
+# Configuration for Gemma 3 1B model
+# Specific sequence length requirement due to small attention window
+max_seq_len: 511
diff --git a/examples/auto_deploy/model_registry/configs/llama3_3_70b.yaml b/examples/auto_deploy/model_registry/configs/llama3_3_70b.yaml
@@ -0,0 +1,10 @@
+# Configuration for Llama 3.3 70B
+# AutoDeploy-specific settings for large Llama models
+
+max_batch_size: 1024
+max_num_tokens: 2048
+free_mem_ratio: 0.9
+trust_remote_code: true
+cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024]
+kv_cache_config:
+  dtype: fp8
diff --git a/examples/auto_deploy/model_registry/configs/llama4_maverick_lite.yaml b/examples/auto_deploy/model_registry/configs/llama4_maverick_lite.yaml
@@ -0,0 +1,5 @@
+# Configuration for Llama 4 Maverick with reduced layers
+# Full model is too large for testing
+model_kwargs:
+  text_config:
+    num_hidden_layers: 5
diff --git a/examples/auto_deploy/model_registry/configs/llama4_scout.yaml b/examples/auto_deploy/model_registry/configs/llama4_scout.yaml
@@ -0,0 +1,10 @@
+# Configuration for Llama 4 Scout (VLM)
+# AutoDeploy-specific settings for Llama 4 Scout MoE vision model
+
+max_batch_size: 1024
+max_num_tokens: 2048
+free_mem_ratio: 0.9
+trust_remote_code: true
+cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024]
+kv_cache_config:
+  dtype: fp8
diff --git a/examples/auto_deploy/model_registry/configs/multimodal.yaml b/examples/auto_deploy/model_registry/configs/multimodal.yaml
@@ -0,0 +1,2 @@
+# Configuration for multimodal (vision + text) models
+model_factory: AutoModelForImageTextToText
diff --git a/examples/auto_deploy/model_registry/configs/openelm.yaml b/examples/auto_deploy/model_registry/configs/openelm.yaml
@@ -0,0 +1,3 @@
+# Configuration for Apple OpenELM models
+# These models require Llama-2 tokenizer
+tokenizer: meta-llama/Llama-2-7b-hf
diff --git a/examples/auto_deploy/model_registry/configs/simple_shard_only.yaml b/examples/auto_deploy/model_registry/configs/simple_shard_only.yaml
@@ -0,0 +1,5 @@
+# Configuration for models that require simple sharding only
+# Used for very large models with specific sharding requirements
+transforms:
+  detect_sharding:
+    simple_shard_only: true
diff --git a/examples/auto_deploy/model_registry/configs/world_size_1.yaml b/examples/auto_deploy/model_registry/configs/world_size_1.yaml
@@ -0,0 +1,2 @@
+# Configuration for single GPU models
+world_size: 1
diff --git a/examples/auto_deploy/model_registry/configs/world_size_2.yaml b/examples/auto_deploy/model_registry/configs/world_size_2.yaml
@@ -0,0 +1,2 @@
+# Configuration for 2 GPU models
+world_size: 2
diff --git a/examples/auto_deploy/model_registry/configs/world_size_4.yaml b/examples/auto_deploy/model_registry/configs/world_size_4.yaml
@@ -0,0 +1,2 @@
+# Configuration for 4 GPU models
+world_size: 4
diff --git a/examples/auto_deploy/model_registry/configs/world_size_8.yaml b/examples/auto_deploy/model_registry/configs/world_size_8.yaml
@@ -0,0 +1,2 @@
+# Configuration for 8 GPU models
+world_size: 8
diff --git a/examples/auto_deploy/model_registry/models.yaml b/examples/auto_deploy/model_registry/models.yaml

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Configuration for Gemma 3 1B model`
	`2`	`+# Specific sequence length requirement due to small attention window`
	`3`	`+max_seq_len: 511`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+# Configuration for multimodal (vision + text) models`
	`2`	`+model_factory: AutoModelForImageTextToText`