Skip to content

Add remote embedding, reranking, and query expansion support#629

Open
georgelichen wants to merge 5 commits intotobi:mainfrom
georgelichen:merge-pr-517-remote-llm
Open

Add remote embedding, reranking, and query expansion support#629
georgelichen wants to merge 5 commits intotobi:mainfrom
georgelichen:merge-pr-517-remote-llm

Conversation

@georgelichen
Copy link
Copy Markdown

This PR ports the remote embedding / reranking work from PR #517 onto the current upstream main, and includes the follow-up fixes needed to make it usable in the current tree.

What is included

  • add OpenAI-compatible remote embedding support
  • add OpenAI-compatible remote reranking support
  • add remote query expansion support via chat completions
  • fix drift against current main (build/model label integration fixes)
  • avoid initializing local node-llama-cpp during qmd embed when remote embedding is configured

Why this PR exists

PR #517 was based on an older branch state. This branch rebases the feature set onto the current upstream main and resolves the integration drift, so the remote LLM path can be reviewed against today's tree.

Verification

  • npx tsc -p tsconfig.build.json
  • npx vitest run --reporter=verbose test/remote-llm.test.ts test/remote-llm-integration.test.ts
  • npx vitest run test/store.test.ts -t "generateEmbeddings" --reporter=verbose
  • npx vitest run test/store.test.ts -t "Token chunking guardrails" --reporter=verbose

Notes

  • Query expansion uses expand_api_* when configured; otherwise normal qmd query "..." still falls back to local expansion.
  • Structured queries (intent:/lex:/vec:/hyde:) skip auto expansion entirely.

Jim Smith and others added 5 commits April 12, 2026 18:26
Support offloading embedding and reranking to remote OpenAI-compatible
servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query
expansion and tokenization via a hybrid routing layer.

- RemoteLLM: HTTP client with circuit breaker, dimension validation,
  batch splitting, auth headers, configurable timeouts
- HybridLLM: routes embed/rerank → remote, generate/expand → local
- LLM interface: add embedBatch, embedModelName; generalize singleton
  and session management from LlamaCpp to LLM
- Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section
- Skip nomic/Qwen3 text formatting prefixes for remote models
- 36 unit tests + 30 integration tests against live vLLM

Related: tobi#489, tobi#427, tobi#446, tobi#511

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add intent? to LLM interface and ILLMSession expandQuery signature
  (store.ts passes { intent } but interface didn't declare it — tsc error)
- Derive embed model label from getDefaultLLM().embedModelName after
  getStore() so content_vectors.model reflects the actual LLM in use
  (previously always stored DEFAULT_EMBED_MODEL_URI even with remote)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RemoteLLM.expandQuery() calls /chat/completions when expandApiModel is
  configured; throws "expandApiModel not configured" otherwise
- Independent circuit breaker for the expand endpoint
- parseExpandResponse() parses lex/vec/hyde lines, filters terms that
  don't share a word with the original query, falls back gracefully on
  bad model output
- RemoteLLM.supportsExpand getter for routing decisions
- HybridLLM routes expandQuery to remote when remote.supportsExpand,
  otherwise falls back to local LlamaCpp (no interface changes)
- remoteConfigFromEnv() handles QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL /
  QMD_EXPAND_API_KEY and YAML expand_api_* fields
- Unit tests (mock HTTP server, VCR-style): payload shape, auth header
  fallback, lex/vec/hyde parsing, includeLexical=false filtering,
  fallback on bad output, query-term filtering, circuit breaker,
  HybridLLM routing (remote vs local), config env vars
- Integration tests: live server connectivity, all three types returned,
  includeLexical=false, intent incorporation, HybridLLM routing verified
  via LOCAL_SENTINEL sentinel (new VLLM_EXPAND_URL / VLLM_EXPAND_MODEL
  env vars, skipped when absent)
Merge PR tobi#517 and keep it compatible with the current main branch.

Constraint: Upstream main diverged after PR tobi#517, so a fast-forward merge was not possible
Rejected: Cherry-pick the PR commits directly | would still require the same compatibility fixes and lose merge context
Confidence: medium
Scope-risk: moderate
Directive: Keep RemoteLLM and HybridLLM aligned with the LLM tokenize/detokenize interface and verify Windows CLI wrappers separately from Unix shell scripts
Tested: npx tsc -p tsconfig.build.json; npx vitest run --reporter=verbose test/remote-llm.test.ts test/remote-llm-integration.test.ts
Not-tested: full vitest suite; npm run build wrapper script on Windows; live GitHub Actions
When the active embedding backend is remote, generateEmbeddings now uses
character-space chunking instead of token-based preprocessing. This keeps
qmd embed from initializing node-llama-cpp solely to tokenize input before
calling a remote embedding API.

The change is scoped to indexing. Query-time expansion and reranking keep
their existing routing rules, and a regression test now fails if remote
embedding falls back to local tokenization during indexing.

Constraint: Remote embedding backends do not expose a tokenizer interface in QMD today
Rejected: Change HybridLLM tokenize() globally | would alter query-time behavior and broaden risk unnecessarily
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: If remote token-aware chunking is added later, keep qmd embed free of mandatory local llama initialization
Tested: npx tsc -p tsconfig.build.json
Tested: npx vitest run test/store.test.ts -t "generateEmbeddings" --reporter=verbose
Tested: npx vitest run test/store.test.ts -t "Token chunking guardrails" --reporter=verbose
Not-tested: Full end-to-end qmd embed against a live remote embedding service after this code change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant