Support running embedding models alongside generation models

Description: Currently, LlamaBarn appears to limit execution to a single model at a time via the llama.cpp API. This creates a bottleneck when using AI coding assistants (like Roo Code) that require two distinct models: one for chat/generation and a separate model for generating embeddings to index the codebase.

Proposed Solution: Allow LlamaBarn to load and serve two models simultaneously, ideally on separate ports or endpoints.

Model A (Generation): Handles standard prompts/chat.

Model B (Embeddings): Handles vector indexing for RAG workflows.

Use Case: When using Roo Code, the user needs to chat with a robust LLM (e.g., Qwen3 Coder) while the tool simultaneously indexes the workspace using a lightweight embedding model (e.g., Nomic-Embed). Currently, this requires stopping/starting models or running a separate instance, disrupting the workflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support running embedding models alongside generation models #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support running embedding models alongside generation models #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions