Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
QMD currently assumes a local
llama.cppstyle setup for generation, embeddings, and reranking which is baked in. This prevents any and all user customization to llama cpp (such as for example, using TheTom's turboquant fork...) and doesn't integrate well with existing homelab servers as it tries to run it all locally on the machine.That makes it harder to use QMD with a local OpenAI-compatible server setup, such as
llama-swapwhich is what I tested this PR with. Even when the server already exposes the same models through/v1/chat/completions,/v1/embeddings, and/v1/rerank, why can't we? Now we can!Solution
This PR adds an OpenAI-compatible backend alongside the existing llama cpp one that lets QMD talk to a local compatible server instead of requiring direct local model access. Meaning now you can run this on your laptop while your homelab does the tensor crunching!
It also makes the CLI and store paths respect configured model aliases, so users can route QMD generation, embedding, and reranking through named server-side models (ex:
qmd-generate,qmd-embed, andqmd-rerank) which I set my server up with for testing this PR.What's Changed?
llm.providerllm.baseUrlllm.apiKeyqmd embedso it uses the configured embedding alias instead of forcing the built-in default model name.Testing
Automated
npx vitest run test/llm.test.ts -t "recovers from oversized rerank requests by splitting and truncating"npx vitest run test/llm.test.ts -t "rerank maps remote indices back to source files"Manual
You can test this with
llama-swapor any server that exposes OpenAI-compatible chat, embedding, and rerank endpoints.Option A: Use llama-swap
qmd-generateqmd-embedqmd-rerankPOST /v1/chat/completionsPOST /v1/embeddingsPOST /v1/rerankOption B: Roll your own compatible server
QMD config example
Create or update your QMD config:
Verify the flow
qmd --index my-index query "How do I unpack EMI archives?" -n 3 --jsonConfirm the server receives requests for:
If your reranker has tighter request limits, verify the query still succeeds and that rerank requests continue after the first oversized split when needed.