Skip to content

Commit 920ca20

Browse files
authored
Blog: Add AMD × vLLM Semantic Router Collaborations (#142)
Signed-off-by: bitliu <[email protected]>
1 parent 9e33e5f commit 920ca20

File tree

5 files changed

+249
-0
lines changed

5 files changed

+249
-0
lines changed

_posts/2025-12-16-vllm-sr-amd.md

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
---
2+
layout: post
3+
title: "AMD × vLLM Semantic Router: Building the System Intelligence Together"
4+
author: "The AMD and vLLM Semantic Router Team"
5+
image: /assets/logos/vllm-logo-text-light.png
6+
---
7+
8+
## Introduction
9+
10+
Over the past several months, AMD and the vLLM SR Team have been collaborating to bring **vLLM Semantic Router (VSR)** to AMD GPUs—not just as a performance optimization, but as a fundamental shift in how we think about AI system architecture.
11+
12+
AMD has been a long-term technology partner for the vLLM community, from accelerating the vLLM inference engine on AMD GPUs and ROCm™ Software to now co-building the next layer of the AI stack: **intelligent routing and governance for Mixture-of-Models (MoM) systems**.
13+
14+
As AI moves from single models to multi-model architectures, the challenge is no longer "how big is your model" but **how intelligently and safely you orchestrate many models together**. VSR is designed to be the **intelligent control plane** for this new era—making routing decisions based on semantic understanding, enforcing safety policies, and maintaining trust as systems scale toward AGI-level capabilities.
15+
16+
![](/assets/figures/semantic-router/amd-0.png)
17+
18+
This collaboration focuses on three strategic pillars:
19+
20+
1. **Signal-Based Routing**: Intelligent request routing using keyword matching, domain classification, semantic similarity, and fact-checking for Multi-LoRA and multi-model deployments
21+
2. **Cross-Instance Intelligence**: Shared state and optimization across vLLM instances through centralized response storage and semantic caching
22+
3. **Guardrails & Governance**: Enterprise-grade security from PII detection and jailbreak prevention to hallucination detection and alignment enforcement
23+
24+
Together with AMD, we're building VSR to run efficiently on AMD GPUs while establishing a new standard for **trustworthy, governable AI infrastructure**.
25+
26+
## The Shift: From Single Models to Mixture-of-Models
27+
28+
In a Mixture-of-Models world, an enterprise AI stack typically includes:
29+
30+
- **Router SLMs** (small language models) that classify, route, and enforce policy
31+
- **Multiple LLMs** and domain-specific models (e.g., code, finance, healthcare, legal)
32+
- **Tools, RAG pipelines**, vector search, and business systems
33+
34+
Without a robust routing layer, this becomes an opaque and fragile mesh. The AMD × VSR collaboration aims to make routing a **first-class, GPU-accelerated infrastructure component**—not an ad-hoc script glued between services.
35+
36+
## VSR Core Capabilities
37+
38+
### 1. Signal-Based Routing for Multi-LoRA Deployments
39+
40+
VSR provides multiple routing strategies to match different use cases:
41+
42+
- **Keyword-based routing**: Simple pattern matching for fast, deterministic routing
43+
- **Domain classification**: Intent-aware adapter selection using trained classifiers
44+
- **Embedding-based semantic similarity**: Nuanced routing based on semantic understanding
45+
- **Fact-checking and verification routing**: High-stakes queries routed to specialized verification pipelines
46+
47+
### 2. Cross-Instance Intelligence
48+
49+
VSR enables shared state and optimization across all vLLM instances:
50+
51+
- **Response API**: Centralized response storage enabling stateful multi-turn conversations
52+
- **Semantic Cache**: Significant token reduction through cross-instance vector similarity matching
53+
54+
### 3. Enterprise-Grade Guardrails
55+
56+
From single-turn to multi-turn conversations, VSR provides:
57+
58+
- **PII Detection**: Prevent sensitive information leakage
59+
- **Jailbreak Prevention**: Block malicious prompt injection attempts
60+
- **Hallucination Detection**: Verify response reliability for critical domains
61+
- **Super Alignment**: Ensuring AI systems remain aligned with human values and intentions as they scale toward AGI capabilities
62+
63+
---
64+
65+
## Running VSR on AMD GPUs: Two Deployment Paths
66+
67+
Our near-term objective is execution-oriented: **deliver a production-grade VSR solution that runs efficiently on AMD GPUs**. We're building two complementary deployment paths:
68+
69+
![](/assets/figures/semantic-router/amd-1.png)
70+
71+
### Path 1: vLLM-Based Inference on AMD GPUs
72+
73+
Using the vLLM engine on AMD GPUs, we run:
74+
75+
**Router SLMs** for:
76+
77+
- Task and intent classification
78+
- Risk scoring and safety gating
79+
- Tool and workflow selection
80+
81+
**LLMs and specialized models** for:
82+
- General assistance
83+
- Domain-specific tasks (finance, legal, code, healthcare)
84+
85+
VSR sits above as the decision fabric, consuming semantic similarity, business metadata, latency constraints, and compliance requirements to perform **dynamic routing** across models and endpoints.
86+
87+
AMD GPUs provide the throughput and memory footprint needed to run **router SLMs + multiple LLMs** in the same cluster, supporting high-QPS workloads with stable latency—not just one-off demos.
88+
89+
### Path 2: Lightweight ONNX-Based Routing
90+
91+
Not all routing needs a full inference stack. For ultra-high-frequency, latency-sensitive stages at the “front door” of the system, we're enabling:
92+
93+
- Exporting router SLMs to **ONNX**
94+
- Running them on AMD GPUs through ONNX Runtime
95+
- Forwarding complex generative work to vLLM or other back-end LLMs
96+
97+
This lightweight path is designed for:
98+
- Front-of-funnel traffic classification and triage
99+
- Large-scale policy evaluation and offline experiments
100+
- Enterprises that want to **standardize on AMD GPUs while keeping model providers flexible**
101+
102+
## Moving to the Next Stage of Semantic Router
103+
104+
When we first built vLLM Semantic Router, the goal was clear and practical: **intelligent model selection**—routing requests to the right model based on task type, cost constraints, and performance requirements.
105+
106+
![](/assets/figures/semantic-router/amd-2.png)
107+
108+
**vLLM Engine** delivers the foundation—running large models stably and efficiently. **vLLM Semantic Router** provides the scheduler—dispatching requests to the right capabilities.
109+
110+
But as AI systems move toward AGI-level capabilities, this framing feels incomplete. It's like discussing engine efficiency without addressing brakes, traffic laws, or safety systems.
111+
112+
**The real challenge isn't making models more powerful—it's maintaining control as they become more powerful.**
113+
114+
### From Models Director to Intelligence Judger
115+
116+
Working with AMD, we've come to see Semantic Router's evolution differently. Its potential lies not just in "routing," but in **governance**—transforming from a traffic director into an **Intelligence Control Plane** for the AGI era.
117+
118+
This shift changes how we think about the collaboration. We're not just optimizing for throughput and latency on AMD hardware. We're building a **constitutional layer** for AI systems—one defined by responsibilities, not just features.
119+
120+
### Three Control Lifelines That Must Be Secured
121+
122+
As we architect VSR on AMD's infrastructure, we're designing around three critical control points that determine whether AI systems remain trustworthy at scale:
123+
124+
![](/assets/figures/semantic-router/amd-3.png)
125+
126+
**1. World Output (Actions)**
127+
128+
The most dangerous capability of powerful models isn't reasoning—it's **execution**. Every action that changes the world (tool calls, database writes, API invocations, configuration changes) must pass through an external checkpoint before execution.
129+
130+
With AMD GPUs, we can run these checkpoints **inline at production scale**—evaluating risk, enforcing policies, and logging decisions without becoming a bottleneck.
131+
132+
**2. World Input (Inputs)**
133+
134+
External inputs are untrusted by default. Web pages, retrieval results, uploaded files, and plugin returns can all carry prompt injection, data poisoning, or privilege escalation attempts.
135+
136+
VSR on AMD infrastructure provides **border inspection** before data reaches the model—running classifiers, sanitizers, and verification checks as a first line of defense, not an afterthought.
137+
138+
**3. Long-Term State (Memory/State)**
139+
140+
The hardest failures to fix aren't wrong answers—they're **wrong answers that get written into long-term memory, system state, or automated workflows**.
141+
142+
Our collaboration focuses on making state management a first-class concern: who can write, what can be written, how to undo, and how to isolate contamination. AMD's GPU infrastructure enables us to run continuous verification and rollback mechanisms that keep state trustworthy over time.
143+
144+
### The Ultimate Question
145+
146+
When these three lifelines are secured, Semantic Router stops being just a model selector. It becomes the answer to a fundamental question:
147+
148+
**How do we transform alignment from a training-time aspiration into a runtime institution?**
149+
150+
This is what the AMD × vLLM Semantic Router collaboration is really about: building not just faster routing, but **trustworthy, governable AI infrastructure** that can scale safely toward AGI-level capabilities.
151+
152+
## Long-Term Vision and Ongoing Work
153+
154+
Our collaboration with AMD extends beyond near-term deployment to building the foundation for next-generation AI infrastructure. We're working on several long-term initiatives:
155+
156+
### Training a Next-Generation Router Model on AMD GPUs
157+
158+
As a longer-term goal, we aim to explore training a **next-generation router model based on encoder-only** on AMD GPUs, optimized for semantic routing, retrieval-augmented generation (RAG), and safety classification.
159+
160+
While recent encoder models (e.g., ModernBERT) show strong performance, they remain limited in context length, multilingual coverage, and alignment with emerging long-context attention techniques. This effort focuses on advancing encoder capabilities using AMD hardware, particularly for **long-context, high-throughput representation learning**.
161+
162+
The outcome will be an **open encoder model** designed to integrate with vLLM Semantic Router and modern AI pipelines, strengthening the retrieval and routing layers of AI systems while expanding hardware-diverse training and deployment options for the community and industry.
163+
164+
### Community Public Beta on AMD Infrastructure
165+
166+
As part of this collaboration, each major release of vLLM Semantic Router will be accompanied by a **public beta environment** hosted on AMD-sponsored infrastructure, available free of charge to the community.
167+
168+
These public betas will allow users to:
169+
- Validate new routing, caching, and safety features
170+
- Gain hands-on experience with Semantic Router running on AMD GPUs
171+
- Provide early feedback that helps improve performance, usability, and system design
172+
173+
By lowering the barrier to experimentation and validation, this initiative aims to strengthen the vLLM ecosystem, accelerate real-world adoption, and ensure that new Semantic Router capabilities are shaped by community input before broader production deployment.
174+
175+
### AMD GPU-Powered CI/CD and End-to-End Testbed
176+
177+
In the long run, we aim to use AMD GPUs to underpin how **VSR as an open-source project is built, validated, and shipped**, ensuring VSR works consistently well with AMD GPUs as the project grows.
178+
179+
We are designing a GPU-backed **CI/CD and end-to-end testbed** where:
180+
- Router SLMs, LLMs, domain models, retrieval, and tools run together on AMD GPU clusters
181+
- Multi-domain, multi-risk-level datasets are replayed as traffic
182+
- Each VSR change runs through an automated evaluation pipeline, including:
183+
- Routing and policy regression tests
184+
- A/B comparisons of new vs. previous strategies
185+
- Stress tests on latency, cost, and scalability
186+
- Focused suites for hallucination mitigation and compliance behavior
187+
188+
The target state is clear:
189+
190+
> **Every VSR release comes with a reproducible, GPU-driven evaluation report, not just a changelog.**
191+
192+
AMD GPUs, in this model, are not only for serving models; they are the **verification engine for the routing infrastructure itself**.
193+
194+
### An AMD-Backed Mixture-of-Models Playground
195+
196+
In parallel, we are planning an **online Mixture-of-Models playground** powered by AMD GPUs, open to the community and partners.
197+
198+
This playground will allow users to:
199+
- Experiment with different routing strategies and model topologies under real workloads
200+
- Observe, in a visual way, how VSR decides which model to call, when to retrieve, and when to apply additional checks or fallbacks
201+
- Compare **quality, latency, and cost trade-offs** across configurations
202+
203+
For model vendors, tool builders, and platform providers, this becomes a **neutral, AMD GPU-backed test environment** to:
204+
- Integrate their components into a MoM stack
205+
- Benchmark under realistic routing and governance constraints
206+
- Showcase capabilities within a transparent, observable system
207+
208+
## Why This Collaboration Matters
209+
210+
Through the AMD × vLLM Semantic Router collaboration, we are aiming beyond “does this model run on this GPU”.
211+
212+
The joint ambitions are:
213+
214+
* To define a **reference architecture for intelligent, GPU-accelerated routing** on AMD platforms, including:
215+
* vLLM-based inference paths,
216+
* ONNX-based lightweight router paths,
217+
* multi-model coordination and safety enforcement.
218+
* To treat routing as **trusted infrastructure**, supported by:
219+
* GPU-powered CI/CD and end-to-end evaluation,
220+
* hallucination-aware and risk-aware policies,
221+
* online learning and adaptive strategies.
222+
* To provide the ecosystem with a **long-lived, AMD GPU–backed MoM playground** where ideas, models, and routing policies can be tested and evolved in the open.
223+
224+
In short, this is about **co-building trustworthy, evolvable multi-model AI infrastructure**—with AMD GPUs as a core execution and validation layer, and vLLM Semantic Router as the intelligent control plane that makes the entire system understandable, governable, and ready for real workloads.
225+
226+
The technical roadmap—hallucination detection, online learning, multi-model orchestration—serves this larger mission. AMD's hardware provides the execution layer. VSR provides the control plane. Together, we're building the foundation for AI systems that remain aligned not through hope, but through **architecture**.
227+
228+
## Acknowledgements
229+
230+
We would like to thank the many talented people who have contributed to this collaboration:
231+
232+
- **AMD**: Andy Luo, Haichen Zhang, and the AMD AI Product Marketing and Engineering teams
233+
- **vLLM SR**: Xunzhuo Liu, Huamin Chen, Chen Wang, Yue Zhu and the vLLM Semantic Router OSS team.
234+
235+
We're excited to keep refining and expanding our optimizations to unlock even greater capabilities in the weeks and months ahead!
236+
237+
## Join Us
238+
239+
**Looking for Collaborations!** Calling all passionate community developers and researchers: join us in training the next-generation router model on AMD GPUs and building the future of trustworthy AI infrastructure.
240+
241+
Interested? Reach out to us:
242+
- Haichen Zhang: [email protected]
243+
- Xunzhuo Liu: [email protected]
244+
245+
**Resources**:
246+
247+
- [AMD ROCm™ Software](https://www.amd.com/en/products/software/rocm.html)
248+
- [vLLM Semantic Router GitHub Repo](https://github.com/vllm-project/semantic-router)
249+
- [vLLM Semantic Router Documentation](https://vllm-semantic-router.com)
222 KB
Loading
5.38 MB
Loading
306 KB
Loading
856 KB
Loading

0 commit comments

Comments
 (0)