Skip to content

Commit 9f6abaf

Browse files
[#9640][feat] Migrate model registry to v2.0 format with composable configs (#9836)
Signed-off-by: Tal Cherckez <[email protected]>
1 parent 7b51e3c commit 9f6abaf

17 files changed

+472
-0
lines changed

examples/auto_deploy/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ benchmark_results.json
66
*.yaml
77
!nano_v3.yaml
88
!nemotron_flash.yaml
9+
!model_registry/configs/*.yaml
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# AutoDeploy Model Registry
2+
3+
The AutoDeploy model registry provides a comprehensive, maintainable list of supported models for testing and coverage tracking.
4+
5+
## Format
6+
7+
**Version: 2.0** (Flat format with composable configurations)
8+
9+
### Structure
10+
11+
```yaml
12+
version: '2.0'
13+
description: AutoDeploy Model Registry - Flat format with composable configs
14+
models:
15+
- name: meta-llama/Llama-3.1-8B-Instruct
16+
yaml_extra: [dashboard_default.yaml, world_size_2.yaml]
17+
18+
- name: meta-llama/Llama-3.3-70B-Instruct
19+
yaml_extra: [dashboard_default.yaml, world_size_4.yaml, llama-3.3-70b.yaml]
20+
21+
- name: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
22+
yaml_extra: [dashboard_default.yaml, world_size_2.yaml, demollm_triton.yaml]
23+
```
24+
25+
### Key Concepts
26+
27+
- **Flat list**: Models are in a single flat list (not grouped)
28+
- **Composable configs**: Each model references YAML config files via `yaml_extra`
29+
- **Deep merging**: Config files are merged in order (later files override earlier ones)
30+
- **No inline args**: All configuration is in YAML files for reusability
31+
32+
## Configuration Files
33+
34+
Config files are stored in `configs/` subdirectory and define runtime parameters:
35+
36+
### Core Configs
37+
38+
| File | Purpose | Example Use |
39+
|------|---------|-------------|
40+
| `dashboard_default.yaml` | Baseline settings for all models | Always first in yaml_extra |
41+
| `world_size_N.yaml` | GPU count (1, 2, 4, 8) | Defines tensor_parallel_size |
42+
43+
### Runtime Configs
44+
45+
| File | Purpose |
46+
|------|---------|
47+
| `multimodal.yaml` | Vision + text models |
48+
| `demollm_triton.yaml` | DemoLLM runtime with Triton backend |
49+
| `simple_shard_only.yaml` | Large models requiring simple sharding
50+
51+
### Model-Specific Configs
52+
53+
| File | Purpose |
54+
|------|---------|
55+
| `llama-3.3-70b.yaml` | Optimized settings for Llama 3.3 70B |
56+
| `nano_v3.yaml` | Settings for Nemotron Nano V3 |
57+
| `llama-4-scout.yaml` | Settings for Llama 4 Scout |
58+
| `openelm.yaml` | Apple OpenELM (custom tokenizer) |
59+
| `gemma3_1b.yaml` | Gemma 3 1B (sequence length) |
60+
| `deepseek_v3_lite.yaml` | DeepSeek V3/R1 (reduced layers) |
61+
| `llama4_maverick_lite.yaml` | Llama 4 Maverick (reduced layers) |
62+
63+
## Adding a New Model
64+
65+
### Simple Model (Standard Config)
66+
67+
```yaml
68+
- name: organization/my-new-model-7b
69+
yaml_extra: [dashboard_default.yaml, world_size_2.yaml]
70+
```
71+
72+
### Model with Special Requirements
73+
74+
```yaml
75+
- name: organization/my-multimodal-model
76+
yaml_extra: [dashboard_default.yaml, world_size_4.yaml, multimodal.yaml]
77+
```
78+
79+
### Model with Custom Config
80+
81+
1. Create `configs/my_model.yaml`:
82+
83+
```yaml
84+
# Custom settings for my model
85+
max_batch_size: 2048
86+
kv_cache_free_gpu_memory_fraction: 0.95
87+
cuda_graph_config:
88+
enable_padding: true
89+
```
90+
91+
2. Reference it in `models.yaml`:
92+
93+
```yaml
94+
- name: organization/my-custom-model
95+
yaml_extra: [dashboard_default.yaml, world_size_8.yaml, my_model.yaml]
96+
```
97+
98+
## Config Merging
99+
100+
Configs are merged in order. Example:
101+
102+
```yaml
103+
yaml_extra:
104+
- dashboard_default.yaml # baseline: runtime=trtllm, benchmark_enabled=true
105+
- world_size_2.yaml # adds: tensor_parallel_size=2
106+
- openelm.yaml # overrides: tokenizer=llama-2, benchmark_enabled=false
107+
```
108+
109+
**Result**: `runtime=trtllm, tensor_parallel_size=2, tokenizer=llama-2, benchmark_enabled=false`
110+
111+
## World Size Guidelines
112+
113+
| World Size | Model Size Range | Example Models |
114+
|------------|------------------|----------------|
115+
| 1 | \< 2B params | TinyLlama, Qwen 0.5B, Phi-4-mini |
116+
| 2 | 2-15B params | Llama 3.1 8B, Qwen 7B, Mistral 7B |
117+
| 4 | 20-80B params | Llama 3.3 70B, QwQ 32B, Gemma 27B |
118+
| 8 | 80B+ params | DeepSeek V3, Llama 405B, Nemotron Ultra |
119+
120+
## Model Coverage
121+
122+
The registry contains models distributed across different GPU configurations (world sizes 1, 2, 4, and 8), including both text-only and multimodal models.
123+
124+
**To verify current model counts and coverage:**
125+
126+
```bash
127+
cd /path/to/autodeploy-dashboard
128+
python3 scripts/prepare_model_coverage_v2.py \
129+
--source local \
130+
--local-path /path/to/TensorRT-LLM \
131+
--output /tmp/model_coverage.yaml
132+
133+
# View summary
134+
grep -E "^- name:|yaml_extra:" /path/to/TensorRT-LLM/examples/auto_deploy/model_registry/models.yaml | wc -l
135+
```
136+
137+
When adding or removing models, use `prepare_model_coverage_v2.py` to validate the registry structure and coverage.
138+
139+
## Best Practices
140+
141+
1. **Always include `dashboard_default.yaml` first** - it provides baseline settings
142+
1. **Always include a `world_size_N.yaml`** - defines GPU count
143+
1. **Add special configs after world_size** - they override defaults
144+
1. **Create reusable configs** - if 3+ models need same settings, make a config file
145+
1. **Use model-specific configs sparingly** - only for unique requirements
146+
1. **Test before committing** - verify with `prepare_model_coverage_v2.py`
147+
148+
## Testing Changes
149+
150+
```bash
151+
# Generate workload from local changes
152+
cd /path/to/autodeploy-dashboard
153+
python3 scripts/prepare_model_coverage_v2.py \
154+
--source local \
155+
--local-path /path/to/TensorRT-LLM \
156+
--output /tmp/test_workload.yaml
157+
158+
# Verify output
159+
cat /tmp/test_workload.yaml
160+
```
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Default configuration for all AutoDeploy dashboard tests
2+
# These are baseline settings that apply to all models unless overridden
3+
4+
runtime: trtllm
5+
attn_backend: flashinfer
6+
compile_backend: torch-compile
7+
model_factory: AutoModelForCausalLM
8+
skip_loading_weights: false
9+
max_seq_len: 512
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Configuration for DeepSeek V3 and R1 with reduced layers
2+
# Full models are too large, so we test with limited layers
3+
model_kwargs:
4+
num_hidden_layers: 10
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Configuration for DemoLLM runtime with Triton backend
2+
# Used for experimental or specific model requirements
3+
runtime: demollm
4+
attn_backend: triton
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Configuration for Gemma 3 1B model
2+
# Specific sequence length requirement due to small attention window
3+
max_seq_len: 511
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Configuration for Llama 3.3 70B
2+
# AutoDeploy-specific settings for large Llama models
3+
4+
max_batch_size: 1024
5+
max_num_tokens: 2048
6+
free_mem_ratio: 0.9
7+
trust_remote_code: true
8+
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024]
9+
kv_cache_config:
10+
dtype: fp8
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Configuration for Llama 4 Maverick with reduced layers
2+
# Full model is too large for testing
3+
model_kwargs:
4+
text_config:
5+
num_hidden_layers: 5
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Configuration for Llama 4 Scout (VLM)
2+
# AutoDeploy-specific settings for Llama 4 Scout MoE vision model
3+
4+
max_batch_size: 1024
5+
max_num_tokens: 2048
6+
free_mem_ratio: 0.9
7+
trust_remote_code: true
8+
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024]
9+
kv_cache_config:
10+
dtype: fp8
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Configuration for multimodal (vision + text) models
2+
model_factory: AutoModelForImageTextToText

0 commit comments

Comments
 (0)