Skip to content

feat: recipe-aware model compatibility for HF search and cache discovery#1390

Draft
ianbmacdonald wants to merge 11 commits intolemonade-sdk:mainfrom
ianbmacdonald:feature/recipe-aware-model-compatibility
Draft

feat: recipe-aware model compatibility for HF search and cache discovery#1390
ianbmacdonald wants to merge 11 commits intolemonade-sdk:mainfrom
ianbmacdonald:feature/recipe-aware-model-compatibility

Conversation

@ianbmacdonald
Copy link
Copy Markdown
Contributor

@ianbmacdonald ianbmacdonald commented Mar 17, 2026

Summary

This builds on the recently added Hugging Face search support for GGUFs and extends search across all backends.

Some of the newer backends are currently pinned to a small set of authors. That still provides value: it can surface models that are not yet in the curated release set, such as newer Qwen3.5 models for FLM at the time of writing. It also gives us room to expand coverage as backend support evolves, for example as whisper.cpp gains support for newer GGUF formats.

Backend Strategy
llamacpp filter=gguf
sd-cpp filter=safetensors,text-to-image
kokoro filter=onnx,text-to-speech
whispercpp author=ggerganov (pinned)
flm author=FastFlowLM (pinned)
ryzenai-llm author=amd + filter=onnx

A helper script, tests/hf_model_tags.py, queries the HF API and reports pipeline_tag plus relevant library, task, and format tags for suggested models or any HF model ID. It is useful for building and refining backend filter logic, especially for evolving image and audio cases.

For example, llama.cpp is currently treated as “all GGUF except unsupported tasks.” If overlap with other GGUF-consuming backends becomes too broad, we can invert that logic and move to a more additive model. Analyzing the current suggested models reveals tag options for a more targeted approach noting 15 models have no pipeline tags upstream.

~/src/lemonade/test$ python3 hf_model_tags.py --summary --llamacpp
...
============================================================
TAG SUMMARY BY RECIPE
============================================================

[llamacpp] (57 models)
   pipeline tags: (none)×15, image-text-to-text, sentence-similarity, text-generation, text-ranking
         formats: gguf
           tasks: conversational, feature-extraction, image-text-to-text, sentence-similarity, text-generation, text-ranking
       libraries: (none)×15, llama.cpp, pytorch, sentence-transformers, transformers, transformers.js, vllm
           other: af, am, ar, az, ba, be, bg, bn, bs, ca, ce, chat, co, code, codeqwen, cross-encoder, cs, custom_code, cy, da, de, deepseek, deploy:azure, dv, edge, el, en, endpoints_compatible, eo, es, et, eu, fa, facebook, fi, fr, fy, ga, gd, gemma, gemma3, gguf-my-repo, gl, gn, google, gpt_oss, granite-4.0, gu, gv, ha, he, hi, hr, ht, hu, hy, id, ig, image-generation, imatrix, it, ja, jv, km, kn, ko, ku, ky, la, language, lfm2, lfm2.5, liquid, llama, llama-3, llama-4, llama-cpp, llama4, lo, lt, lv, math, meta, mg, mi, microsoft, mistral-common, mk, ml, mn, moe, mr, ms, multilingual, mxfp4, my, ne, nl, nlp, nn, no, nvidia, ny, openai, pa, phi, phi3, phi4, pl, prompt-compression, prompt-engineering, prompt-expansion, ps, pt, q4_k_m, quantized, qwen, qwen-coder, qwen3, qwen3_5_moe, qwen3_moe, qwen3_next, reranker, ro, ru, sd, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, text-embeddings-inference, tg, th, tl, tn, tr, ug, uk, unsloth, ur, uz, vi, xh, yi, yo, zh, zu

Which might lead to something like below if the GGUF filter with task filters draws in too many unsupported models but missed some without pipeline tags.

(hf_api_formats = GGUF) AND (hf_api_pipeline_tags_set OR hf_api_task_tags_set OR hf_api_other_tags_set)

Model Repo Links

HF search often requires opening the repo page to inspect quant sizes, runtime flags, chat template details, and example usage. This PR adds the repo link to each searched model so users can quickly click through and validate details in a new tab.

HF search proxy with auth

  • Adds a GET /hf/search proxy that forwards server-side HF_TOKEN, doubling the default rate limit from 500 to 1000 requests per 5 minutes
  • Adds cursor-based pagination with ‹ N › controls
  • Adds adaptive cooldown behavior based on the RateLimit response header
  • Improves rate-limit messaging by showing the exact retry time and suggesting HF_TOKEN when unauthenticated
  • Whitelists the author parameter in the proxy for pinned-provider queries

HF Cache

The HF cache is often shared across tools and services. Exposing it in the model manager lets Lemonade discover models that were added outside Lemonade and register them without re-downloading. This is especially useful on systems with shared model stores, multi-OS setups, or large mounted repositories where users want to “take models off the shelf” into Lemonade and optionally leave the files in place when removing them later.

This PR adds HF cache discovery and a UI option to keep cached files when removing a model from Lemonade.

  • Adds GET /cache/models to scan the local HF cache for downloaded models that are not yet registered
  • Lets users register cached models without re-downloading, which is especially useful for testing recipe setups
  • Adds a FROM HF CACHE section in the UI with quant dropdowns, recipe badges, size display, and one-click registration
  • Rolls up multiple quants from the same author, which is helpful when comparing quants and promoting one into Lemonade
  • Handles symlinked HF cache layouts, sharded models, root-level and folder-based layouts, and mmproj detection
  • Skips downloads for models already present in cache
  • Adds a server-side keep_files option so removal can unregister the model without deleting cached files
  • Makes removed models reappear in the HF cache section for easy re-registration

Filter Dialog

The filter dialog now supports including or excluding any backend from search, so users can narrow results to the backends they care about.

It also makes the active filter state more visible by reflecting both default behavior and current backend availability. Backends that are not active are automatically excluded from search. Users can also hide the new HF Cache and HF Online sections to restore the previous suggested-models-only layout.

  • Backend chips are color-coded by state: green = installed, yellow = available, red = unsupported

Model Quants

Improves quant detection and ordering across flat and nested folder layouts.

  • Adds MXFP4 recognition across all quant regexes
  • Expands regex support for compact forms like q4k and extended names like UD-Q3_K_XL
  • Adds UD quant support
  • Fixes ordering to better match HF’s increasing bit-accuracy progression
  • Deduplicates root-level sharded GGUF files in the quant dropdown

Core: recipe-aware classification

  • Adds recipeCompatibility.ts with classifyModel(), which prioritizes Hugging Face pipeline_tag metadata over file-format-only detection
  • Adds RECIPE_FORMATS to gate classification by file-format compatibility, for example preventing sd-cpp from matching GGUF models
  • Uses a three-pass classification flow: pipeline_tag → repo tags → name pattern matching
  • Introduces four confidence levels: supported, likely, experimental, and incompatible
  • Marks experimental models with a yellow ? badge and requires confirmation before install
  • Aligns badging to backend compatibility rather than raw file format

Vision model (mmproj) handling

  • Returns mmproj_files from the cache endpoint for vision model detection
  • If multiple mmproj files are found, opens the Add Model dialog for user selection
  • If exactly one mmproj file is found, auto-selects it and installs directly
  • Prefers BF16 > F16 > F32 when picking a default mmproj

Other UX improvements

  • Single-quant models now show the quant in the dropdown for consistency with multi-quant models
  • Adds a recipe mismatch warning in the Add Model panel when the checkpoint suggests a different modality
  • Fixes the [object Object] error toast for model load failures
  • Shows “No more results” on later pages instead of “No compatible models found”

Backend fixes

  • Computes model sizes from actual files on disk for user-registered models, including sharded models
  • Updates HttpClient::get to capture response headers for rate-limit parsing
  • Skips download after user registration when files already exist in the HF cache

Files changed

File Description
New: src/app/src/renderer/utils/recipeCompatibility.ts Task-to-recipe mapping and model classification
src/app/src/renderer/ModelManager.tsx HF cache section, pagination, proxy integration, vision handling
src/app/src/renderer/AddModelPanel.tsx mmproj default selection, recipe mismatch warning
src/app/src/renderer/ConfirmDialog.tsx Optional checkbox support for keep-files
src/app/src/renderer/components/ConnectedBackendRow.tsx Updated for new confirm dialog return type
src/app/src/renderer/utils/backendInstaller.ts labels field, keep_files param, error message fix
src/app/styles.css Experimental badge, pagination, cooldown animation, warning styles, checkbox
src/cpp/include/lemon/model_manager.h discover_hf_cache_models(), delete_model(keep_files)
src/cpp/server/model_manager.cpp Cache discovery, GGUF path resolver for sd-cpp, size computation, skip-download logic
src/cpp/server/server.cpp /cache/models and /hf/search endpoints, keep_files support
src/cpp/server/utils/http_client.cpp Response header capture for rate-limit parsing

Test plan

  • Search for stable-diffusion — SD models should show the sd.cpp badge, not llama.cpp
  • Search for whisper — only .bin-based whisper models should be shown; GGUF-only models should be hidden
  • Search for Qwen3-VL — vision models should show the llama.cpp badge and detect mmproj
  • HF cache section shows unregistered models with correct recipe badges
  • Adding from cache with mmproj files opens the edit dialog when multiple files are present, or installs directly when only one exists
  • Deleting a user model with Keep files checked makes the model reappear in the cache section
  • Pagination ‹ › controls work, and rate-limit cooldown adapts to remaining quota
  • HF_TOKEN in the environment doubles the rate limit; verify authenticated: true in the proxy response
  • Existing curated models, including Whisper-Large-v3-Turbo, still load correctly
  • Search "qwen" with only llamacpp chip → only GGUF results
  • Enable flm chip, search "qwen" → FastFlowLM Qwen NPU models appear alongside GGUF
  • Enable kokoro chip, search "kokoro" → ONNX TTS models appear
  • Backend chips show correct colors based on system hardware
  • Unsupported backends default to disabled, can be toggled on (red chip)

Tested on Ubuntu 26.04 via the web interface using lemonade-server.

image image

🤖 Generated with Claude Code using the 1M context window on Opus with no compaction

…ery (lemonade-sdk#1381)

Resolves lemonade-sdk#1381 — Model search was showing incompatible non-LLM models
because all GGUF files were blindly routed to llama.cpp. This PR
introduces task-first model classification that separates format, task,
and backend/recipe into distinct concepts.

## Recipe-aware classification

- New `recipeCompatibility.ts` module with `classifyModel()` that
  prioritizes HuggingFace `pipeline_tag` over file format detection
- Supports `text-generation`, `image-text-to-text`, `text-to-image`,
  `automatic-speech-recognition`, `text-to-speech` pipeline tags
- Four confidence levels: supported, likely, experimental, incompatible
- Experimental models show yellow badge with `?` and require
  confirmation before install

## HF cache discovery

- New `GET /cache/models` endpoint scans the HF cache directory for
  downloaded models not yet registered in the model registry
- Frontend "FROM HF CACHE" section with quant dropdowns, recipe badges,
  and one-click registration
- Handles symlinked HF cache layouts (canonical path resolution)
- Skips re-download for models already present in cache
- Properly groups sharded models (folder-based and root-level)

## HF search improvements

- New `GET /hf/search` proxy endpoint passes `HF_TOKEN` from server
  environment for doubled rate limits (500 → 1000 req/5min)
- Cursor-based pagination with `‹ N ›` controls
- Adaptive rate limiting cooldown based on `RateLimit` response header
- Rate limit message shows exact retry time and suggests HF_TOKEN
- Model names are clickable links to HuggingFace pages

## Whisper.cpp handling

- Whisper models use `.bin` files for quant selection (not `.gguf`)
- GGUF-only whisper models are hidden until whisper.cpp GGUF support
- Removed whisper from GGUF path resolver (was breaking `.bin` models)

## SD.cpp / image model handling

- SD models with GGUF files correctly route to `sd-cpp` (not `llamacpp`)
- GGUF path resolver extended to `sd-cpp` for variant matching

## Vision model (mmproj) handling

- Cache endpoint returns `mmproj_files` for vision model detection
- Multiple mmproj files → opens Add Model dialog for user selection
- Single mmproj → auto-selects and installs directly
- Default mmproj preference: BF16 > F16 > F32
- Recipe mismatch warning in Add Model panel

## Model deletion with keep-files option

- Delete dialog for `user.*` models shows "Keep downloaded files in HF
  cache" checkbox (unchecked by default)
- Server accepts `keep_files` parameter — removes from registry only
- Model reappears in HF cache section for easy re-registration

## Additional fixes

- Fixed `[object Object]` error toast for load failures (server returns
  nested error objects)
- Model sizes computed from actual files on disk for user-registered
  models (including sharded models)
- MXFP4 quantization format recognized in all quant regexes
- Quant regex widened for compact forms (q4k) and extended names
  (UD-Q3_K_XL)
- Root-level sharded GGUF files deduplicated in quant dropdown
- Non-quant folders (e.g. `whisper.cpp/`) expand to individual files
- Confirm dialog supports optional checkbox with controlled state
- HttpClient::get now captures response headers (for rate limit parsing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ianbmacdonald ianbmacdonald marked this pull request as ready for review March 17, 2026 07:47
ianbmacdonald and others added 7 commits March 17, 2026 11:15
Rename ConfirmCheckbox/checkbox/checkboxChecked to
KeepFilesOption/keepFilesOption/keepFiles throughout ConfirmDialog
and its callers to better reflect the HF cache preservation feature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move HF cache discovery state below HF search state instead of
interleaved in the middle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Server-side: discover_hf_cache_models() fetches pipeline_tag from HF API
for each cached model, enabling accurate recipe classification.

Frontend: cache models grouped by provider slug (e.g., "unsloth (4)")
using collapsible sections. Extracted CacheModelInfo interface and
renderCacheModelItem/renderCacheProviderGroup functions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…earch fixes

- Recipe filter chips (llama.cpp, sd.cpp, whisper.cpp, Kokoro, FLM,
  RyzenAI) apply across all three sections: suggested, cache, and search
- Section visibility toggles to show/hide suggested, HF cache, and
  HF search independently
- Filter icon turns green when non-default filters are active
- Multi-word search matches each word independently ("Qwen Image"
  matches "Qwen-Image-GGUF")
- Fix URL encoding in HF search proxy for queries with spaces

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add TASK_RECIPE_MAP entries for embedding models (sentence-similarity,
feature-extraction pipeline tags) and reranking models (text-ranking)
so they route to llamacpp instead of being marked incompatible.

Expand sd-cpp hfTags with image-generation and image-editing to catch
FLUX models via repository tags.

Add name pattern fallbacks: /embed/i, /nomic/i, /rerank/i.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Sort quantizations by bit-depth (ascending) matching HuggingFace
  ordering: IQ1 → Q2 → Q3 → Q4 → Q5 → Q6 → Q8 → BF16/F16/F32
- UD (Unsloth Dynamic) quants labeled with (UD) suffix for clarity
- UD variants sort right after their non-UD equivalent
- Default quant selection prefers Q4_K_M when available
- Tooltip shows count when >10 quants ("scroll for more")
- URL encode search proxy params (fixes spaces in queries)
- Green filter icon when non-default filters active
- Multi-word search matches each word independently

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GGUF path resolver was catching sd-cpp models with safetensors
checkpoints (e.g. Z-Image-Turbo), finding no .gguf files, and returning
the directory path instead of falling through to the generic resolver.

Now skips the GGUF resolver when the variant explicitly contains a
non-GGUF extension (.safetensors, .onnx, .bin).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ianbmacdonald ianbmacdonald marked this pull request as draft March 18, 2026 17:28
Resolve merge conflicts with main, keeping recipe-aware filtering.

HF Search:
- Issue parallel search queries per enabled recipe backend
- llamacpp: filter=gguf, sd-cpp: filter=safetensors,text-to-image,
  kokoro: filter=onnx,text-to-speech
- Pinned providers: whispercpp→ggerganov, flm→FastFlowLM,
  ryzenai-llm→amd (surfaces new models without registry updates)
- Merge, deduplicate, sort by downloads across all backends
- Per-recipe cursor tracking for correct multi-backend pagination
- Pre-filter: skip detectBackend for models with unsupported
  pipeline_tag (saves 2 HF API calls per incompatible model)
- Add author param to HF search proxy whitelist

Format gating (recipeCompatibility.ts):
- RECIPE_FORMATS map defines supported file formats per backend
- hasRequiredFormat() gates classification by format compatibility
  (e.g. sd-cpp requires safetensors, not gguf)
- Uses HF format tags with file extension fallback
- SUPPORTED_PIPELINE_TAGS exported for pre-filter
- Add translation and image-to-text to LLM_PIPELINE_TAGS

Filter chips:
- Color-coded by backend state: green (installed), yellow
  (available/installable), red (unsupported)
- Inactive chips show subtle state tint (unsupported distinguishable
  from deselected)
- Default to only viable backends on system detection
- Filter indicator compares against viable count

Test tooling:
- New test/hf_model_tags.py: HF model tag analysis with --detect,
  --summary, per-recipe flags, rate limiting, HF_TOKEN support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ianbmacdonald ianbmacdonald marked this pull request as ready for review March 19, 2026 06:00
These Python-only quantization formats require bitsandbytes/autoawq/auto-gptq
runtimes and cannot be loaded by any C++ backend (llamacpp, sd-cpp, etc.).
Check model ID for bnb/awq/gptq markers before classification to avoid
false positives (e.g. bnb-4bit safetensors models classified as sd-cpp).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ianbmacdonald ianbmacdonald marked this pull request as draft March 19, 2026 14:41
@ianbmacdonald
Copy link
Copy Markdown
Contributor Author

Search heuristics on Hugging Face models is pandoras box until the newer backends start supporting more model nuances; sd.cpp [almost] works with Qwen-Image .. just not on ROCm (goes to black halfway through layers); whisper.cpp almost works with GGUF but not yet; and FLM is built around a single-user/windows use case and just doesn't support the HF ecosystem properly FastFlowLM/FastFlowLM#406 , so dynamic search and add for new models doesn't fit the use case for a shared model shelf unless you are dragging around the proprietary folder format. Moving this back to draft. I may cherry pick a PR for the HF cache piece which is the best feature IMHO for anyone sharing model shelfs between users, OSs, etc. but gone for the next couple of weeks. @jeremyfowers

Move "Downloaded only" toggle above section toggles for better discoverability,
rename "Group suggested models" to "Suggested models".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant