Skip to content

Support running embedding models alongside generation models #26

@jmcallister83

Description

@jmcallister83

Description: Currently, LlamaBarn appears to limit execution to a single model at a time via the llama.cpp API. This creates a bottleneck when using AI coding assistants (like Roo Code) that require two distinct models: one for chat/generation and a separate model for generating embeddings to index the codebase.

Proposed Solution: Allow LlamaBarn to load and serve two models simultaneously, ideally on separate ports or endpoints.

Model A (Generation): Handles standard prompts/chat.

Model B (Embeddings): Handles vector indexing for RAG workflows.

Use Case: When using Roo Code, the user needs to chat with a robust LLM (e.g., Qwen3 Coder) while the tool simultaneously indexes the workspace using a lightweight embedding model (e.g., Nomic-Embed). Currently, this requires stopping/starting models or running a separate instance, disrupting the workflow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-infoWaiting for more information

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions