Skip to content

Dynamically route user prompts to LoRA adapters or a base LLM using semantic evaluation on Red Hat OpenShift AI with LiteLLM and vLLM.

License

Notifications You must be signed in to change notification settings

rh-ai-quickstart/dynamic-model-router

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dynamic Model Routing with Semantic Evaluation and LoRA Adapters

Welcome to the LLM Dynamic Model Routing quickstart! Use this to quickly deploy a semantic router with LoRA adapters and a fallback base model using LiteLLM and vLLM.

Detailed description

This quickstart demonstrates how to build a cost-efficient and scalable LLM deployment by combining semantic routing, LoRA adapters, and vLLM on Red Hat OpenShift AI. Incoming user requests are processed through LiteLLM, semantically evaluated, and routed to the most appropriate LoRA adapter or the default base model to ensure more precise and context-aware responses without hosting multiple full models.

Architecture diagrams

The architecture integrates several components to ensure efficient request handling and accurate responses. Open WebUI provides an intuitive interface for users to interact smoothly with the system, and LiteLLM, which acts as a proxy, utilizes the semantic router to determine which is the most suitable destination—whether it’s the base model or a specialized LoRA adapter. Based on this decision, LiteLLM forwards the request to vLLM for inference.

vllm_lora_litellm.jpg

References

Requirements

Minimum hardware requirements

This demo is designed to run a model using GPU acceleration. The following hardware resources are required:

  • CPU: 1 vCPU
  • Memory: 4 GiB
  • GPU: 1 NVIDIA GPU (e.g., A10, A100, L40S, or similar)

Required software

  • Red Hat OpenShift
  • Red Hat OpenShift AI

Required permissions

  • Standard user. No elevated cluster permissions required.

Install

Let’s dive into the technical aspects of building this setup with a practical example. We are using the Phi-2 LLM as the base model, and two LoRA adapters Phi2-Doctor28e and phi-2-dcot.

This repository packages all the required components as a helm chart, which helps you deploy with one command.

Clone

git clone https://github.com/rh-ai-quickstart/dynamic-model-router
cd dynamic-model-router/chart

Create a project

PROJECT="dynamic-model-router-demo"
oc new-project ${PROJECT}

Install with Helm

helm install dynamic-model-router . --namespace ${PROJECT}

Validating the deployment

After deployment, verify that the setup works as expected by logging into Open WebUI and sending various queries. By checking the LiteLLM logs, you can ensure that requests are correctly routed to either the base model or the relevant adapter.

For instance, in a clinical healthcare scenario, if you ask, “Is there medication for my symptoms?” the query will be routed to the phi-2-doctor adapter. You can then verify the routing by checking the LiteLLM logs, as shown below.

Please keep in mind that while the model may provide useful information, all output should be reviewed for its suitability and it is essential to consult a professional for personalized advice.

To validate the connection between the OpenWebUI and the LiteLLM proxy click on the top left and you should see phi2, dcot and doctor models listed. And to confirm that the proxy is working, you should see similar logs in LiteLLM pod as you ask questions.

LiteLLM: Proxy initialized with Config, Set models:
dcot
doctor
phi2
INFO: 10.131.166.110:55626 - "GET /models HTTP/1.1" 200 OK
INFO: 10.131.166.110:52568 - "GET /models HTTP/1.1" 200 OK
[RouteChoice(name='dcot', function_call=None, similarity_score=0.6218189497571418), RouteChoice(name='doctor', function_call=None, similarity_score=0.6802293320602901)]
doctor
INFO: 10.131.166.110:52576 - "POST /chat/completions HTTP/1.1" 200 OK

Configuring the Semantic Router

The semantic router is invoked by a LiteLLM pre-invoke function and is run before the call to the actual LLM endpoint is made. This functions uses the semantic router framework to decide which models the request should be sent to.

The code is located in litellm-config/custom_router.py if you'd like to make changes.

Uninstall

helm uninstall dynamic-model-router --namespace ${PROJECT} 

About

Dynamically route user prompts to LoRA adapters or a base LLM using semantic evaluation on Red Hat OpenShift AI with LiteLLM and vLLM.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published