Support running embedding models alongside generation models #27
jmcallister83
started this conversation in
Ideas
Replies: 1 comment
-
|
Thanks for the feature request! I understand the need for running multiple models simultaneously, especially for coding assistants. Currently, LlamaBarn only supports one model at a time and doesn't include embedding models. I'd love to add concurrent model support eventually, but I'm hesitant about prioritizing this now. The main concerns are:
This is definitely something we'll consider as LlamaBarn grows. We'll keep an eye on how agents evolve and gather more user feedback. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Description: Currently, LlamaBarn appears to limit execution to a single model at a time via the llama.cpp API. This creates a bottleneck when using AI coding assistants (like Roo Code) that require two distinct models: one for chat/generation and a separate model for generating embeddings to index the codebase.
Proposed Solution: Allow LlamaBarn to load and serve two models simultaneously, ideally on separate ports or endpoints.
Model A (Generation): Handles standard prompts/chat.
Model B (Embeddings): Handles vector indexing for RAG workflows.
Use Case: When using Roo Code, the user needs to chat with a robust LLM (e.g., Qwen3 Coder) while the tool simultaneously indexes the workspace using a lightweight embedding model (e.g., Nomic-Embed). Currently, this requires stopping/starting models or running a separate instance, disrupting the workflow.
Beta Was this translation helpful? Give feedback.
All reactions