-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Description: Currently, LlamaBarn appears to limit execution to a single model at a time via the llama.cpp API. This creates a bottleneck when using AI coding assistants (like Roo Code) that require two distinct models: one for chat/generation and a separate model for generating embeddings to index the codebase.
Proposed Solution: Allow LlamaBarn to load and serve two models simultaneously, ideally on separate ports or endpoints.
Model A (Generation): Handles standard prompts/chat.
Model B (Embeddings): Handles vector indexing for RAG workflows.
Use Case: When using Roo Code, the user needs to chat with a robust LLM (e.g., Qwen3 Coder) while the tool simultaneously indexes the workspace using a lightweight embedding model (e.g., Nomic-Embed). Currently, this requires stopping/starting models or running a separate instance, disrupting the workflow.