Skip to content

feat(Add OpenAI-compatible backend support)#619

Open
loopyd wants to merge 1 commit intotobi:mainfrom
loopyd:feat/openai-compatible-llamaswap
Open

feat(Add OpenAI-compatible backend support)#619
loopyd wants to merge 1 commit intotobi:mainfrom
loopyd:feat/openai-compatible-llamaswap

Conversation

@loopyd
Copy link
Copy Markdown

@loopyd loopyd commented May 2, 2026

Problem

QMD currently assumes a local llama.cpp style setup for generation, embeddings, and reranking which is baked in. This prevents any and all user customization to llama cpp (such as for example, using TheTom's turboquant fork...) and doesn't integrate well with existing homelab servers as it tries to run it all locally on the machine.

That makes it harder to use QMD with a local OpenAI-compatible server setup, such as llama-swap which is what I tested this PR with. Even when the server already exposes the same models through /v1/chat/completions, /v1/embeddings, and /v1/rerank, why can't we? Now we can!

Solution

This PR adds an OpenAI-compatible backend alongside the existing llama cpp one that lets QMD talk to a local compatible server instead of requiring direct local model access. Meaning now you can run this on your laptop while your homelab does the tensor crunching!

It also makes the CLI and store paths respect configured model aliases, so users can route QMD generation, embedding, and reranking through named server-side models (ex: qmd-generate, qmd-embed, and qmd-rerank) which I set my server up with for testing this PR.

What's Changed?

  • Added config support for:
    • llm.provider
    • llm.baseUrl
    • llm.apiKey
  • Updated QMD to select the backend when configured.
  • Routed query expansion, embedding, and reranking through the configured remote model aliases.
  • Updated vector search and rerank paths to use the configured embed and rerank model names instead of hardcoded defaults.
  • Updated qmd embed so it uses the configured embedding alias instead of forcing the built-in default model name.
  • Updated help and status output to show the active configured models and backend more clearly.
  • Added rerank handling that recovers from oversized rerank requests by splitting batches and truncating oversized single documents when needed.
  • Added and updated tests for the new backend behavior.

Testing

Automated

  1. Run the focused LLM test for the new rerank recovery path:
npx vitest run test/llm.test.ts -t "recovers from oversized rerank requests by splitting and truncating"
  1. Run the existing adjacent OpenAI-compatible rerank mapping test:
npx vitest run test/llm.test.ts -t "rerank maps remote indices back to source files"

Manual

You can test this with llama-swap or any server that exposes OpenAI-compatible chat, embedding, and rerank endpoints.

Option A: Use llama-swap

  1. Start or prepare a local OpenAI-compatible server.
  2. Expose three model IDs, for example:
    • qmd-generate
    • qmd-embed
    • qmd-rerank
  3. Make sure the server exposes these routes:
    • POST /v1/chat/completions
    • POST /v1/embeddings
    • POST /v1/rerank

Option B: Roll your own compatible server

  1. Use any local server that follows the same OpenAI-compatible route layout.
  2. Configure one model for generation, one for embeddings, and one for reranking.
  3. Point QMD at that server with the config below.

QMD config example

Create or update your QMD config:

models:
  generate: qmd-generate
  embed: qmd-embed
  rerank: qmd-rerank

llm:
  provider: openai-compatible
  baseUrl: http://127.0.0.1:8080/v1
  apiKey: your-local-api-key

Verify the flow

  1. Check help and status:
qmd --index my-index --help
qmd --index my-index status
  1. Build embeddings:
qmd --index my-index embed
  1. Run a query:
qmd --index my-index query "How do I unpack EMI archives?" -n 3 --json
  1. Confirm the server receives requests for:

    • chat completions
    • embeddings
    • rerank
  2. If your reranker has tighter request limits, verify the query still succeeds and that rerank requests continue after the first oversized split when needed.

@loopyd loopyd changed the title Add OpenAI-compatible llama-swap backend support Add OpenAI-compatible backend support May 2, 2026
@loopyd loopyd changed the title Add OpenAI-compatible backend support feat(Add OpenAI-compatible backend support) May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant