feat: recipe-aware model compatibility for HF search and cache discovery#1390
feat: recipe-aware model compatibility for HF search and cache discovery#1390ianbmacdonald wants to merge 11 commits intolemonade-sdk:mainfrom
Conversation
…ery (lemonade-sdk#1381) Resolves lemonade-sdk#1381 — Model search was showing incompatible non-LLM models because all GGUF files were blindly routed to llama.cpp. This PR introduces task-first model classification that separates format, task, and backend/recipe into distinct concepts. ## Recipe-aware classification - New `recipeCompatibility.ts` module with `classifyModel()` that prioritizes HuggingFace `pipeline_tag` over file format detection - Supports `text-generation`, `image-text-to-text`, `text-to-image`, `automatic-speech-recognition`, `text-to-speech` pipeline tags - Four confidence levels: supported, likely, experimental, incompatible - Experimental models show yellow badge with `?` and require confirmation before install ## HF cache discovery - New `GET /cache/models` endpoint scans the HF cache directory for downloaded models not yet registered in the model registry - Frontend "FROM HF CACHE" section with quant dropdowns, recipe badges, and one-click registration - Handles symlinked HF cache layouts (canonical path resolution) - Skips re-download for models already present in cache - Properly groups sharded models (folder-based and root-level) ## HF search improvements - New `GET /hf/search` proxy endpoint passes `HF_TOKEN` from server environment for doubled rate limits (500 → 1000 req/5min) - Cursor-based pagination with `‹ N ›` controls - Adaptive rate limiting cooldown based on `RateLimit` response header - Rate limit message shows exact retry time and suggests HF_TOKEN - Model names are clickable links to HuggingFace pages ## Whisper.cpp handling - Whisper models use `.bin` files for quant selection (not `.gguf`) - GGUF-only whisper models are hidden until whisper.cpp GGUF support - Removed whisper from GGUF path resolver (was breaking `.bin` models) ## SD.cpp / image model handling - SD models with GGUF files correctly route to `sd-cpp` (not `llamacpp`) - GGUF path resolver extended to `sd-cpp` for variant matching ## Vision model (mmproj) handling - Cache endpoint returns `mmproj_files` for vision model detection - Multiple mmproj files → opens Add Model dialog for user selection - Single mmproj → auto-selects and installs directly - Default mmproj preference: BF16 > F16 > F32 - Recipe mismatch warning in Add Model panel ## Model deletion with keep-files option - Delete dialog for `user.*` models shows "Keep downloaded files in HF cache" checkbox (unchecked by default) - Server accepts `keep_files` parameter — removes from registry only - Model reappears in HF cache section for easy re-registration ## Additional fixes - Fixed `[object Object]` error toast for load failures (server returns nested error objects) - Model sizes computed from actual files on disk for user-registered models (including sharded models) - MXFP4 quantization format recognized in all quant regexes - Quant regex widened for compact forms (q4k) and extended names (UD-Q3_K_XL) - Root-level sharded GGUF files deduplicated in quant dropdown - Non-quant folders (e.g. `whisper.cpp/`) expand to individual files - Confirm dialog supports optional checkbox with controlled state - HttpClient::get now captures response headers (for rate limit parsing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename ConfirmCheckbox/checkbox/checkboxChecked to KeepFilesOption/keepFilesOption/keepFiles throughout ConfirmDialog and its callers to better reflect the HF cache preservation feature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move HF cache discovery state below HF search state instead of interleaved in the middle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Server-side: discover_hf_cache_models() fetches pipeline_tag from HF API for each cached model, enabling accurate recipe classification. Frontend: cache models grouped by provider slug (e.g., "unsloth (4)") using collapsible sections. Extracted CacheModelInfo interface and renderCacheModelItem/renderCacheProviderGroup functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…earch fixes
- Recipe filter chips (llama.cpp, sd.cpp, whisper.cpp, Kokoro, FLM,
RyzenAI) apply across all three sections: suggested, cache, and search
- Section visibility toggles to show/hide suggested, HF cache, and
HF search independently
- Filter icon turns green when non-default filters are active
- Multi-word search matches each word independently ("Qwen Image"
matches "Qwen-Image-GGUF")
- Fix URL encoding in HF search proxy for queries with spaces
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add TASK_RECIPE_MAP entries for embedding models (sentence-similarity, feature-extraction pipeline tags) and reranking models (text-ranking) so they route to llamacpp instead of being marked incompatible. Expand sd-cpp hfTags with image-generation and image-editing to catch FLUX models via repository tags. Add name pattern fallbacks: /embed/i, /nomic/i, /rerank/i. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Sort quantizations by bit-depth (ascending) matching HuggingFace
ordering: IQ1 → Q2 → Q3 → Q4 → Q5 → Q6 → Q8 → BF16/F16/F32
- UD (Unsloth Dynamic) quants labeled with (UD) suffix for clarity
- UD variants sort right after their non-UD equivalent
- Default quant selection prefers Q4_K_M when available
- Tooltip shows count when >10 quants ("scroll for more")
- URL encode search proxy params (fixes spaces in queries)
- Green filter icon when non-default filters active
- Multi-word search matches each word independently
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GGUF path resolver was catching sd-cpp models with safetensors checkpoints (e.g. Z-Image-Turbo), finding no .gguf files, and returning the directory path instead of falling through to the generic resolver. Now skips the GGUF resolver when the variant explicitly contains a non-GGUF extension (.safetensors, .onnx, .bin). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve merge conflicts with main, keeping recipe-aware filtering. HF Search: - Issue parallel search queries per enabled recipe backend - llamacpp: filter=gguf, sd-cpp: filter=safetensors,text-to-image, kokoro: filter=onnx,text-to-speech - Pinned providers: whispercpp→ggerganov, flm→FastFlowLM, ryzenai-llm→amd (surfaces new models without registry updates) - Merge, deduplicate, sort by downloads across all backends - Per-recipe cursor tracking for correct multi-backend pagination - Pre-filter: skip detectBackend for models with unsupported pipeline_tag (saves 2 HF API calls per incompatible model) - Add author param to HF search proxy whitelist Format gating (recipeCompatibility.ts): - RECIPE_FORMATS map defines supported file formats per backend - hasRequiredFormat() gates classification by format compatibility (e.g. sd-cpp requires safetensors, not gguf) - Uses HF format tags with file extension fallback - SUPPORTED_PIPELINE_TAGS exported for pre-filter - Add translation and image-to-text to LLM_PIPELINE_TAGS Filter chips: - Color-coded by backend state: green (installed), yellow (available/installable), red (unsupported) - Inactive chips show subtle state tint (unsupported distinguishable from deselected) - Default to only viable backends on system detection - Filter indicator compares against viable count Test tooling: - New test/hf_model_tags.py: HF model tag analysis with --detect, --summary, per-recipe flags, rate limiting, HF_TOKEN support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These Python-only quantization formats require bitsandbytes/autoawq/auto-gptq runtimes and cannot be loaded by any C++ backend (llamacpp, sd-cpp, etc.). Check model ID for bnb/awq/gptq markers before classification to avoid false positives (e.g. bnb-4bit safetensors models classified as sd-cpp). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Search heuristics on Hugging Face models is pandoras box until the newer backends start supporting more model nuances; sd.cpp [almost] works with Qwen-Image .. just not on ROCm (goes to black halfway through layers); whisper.cpp almost works with GGUF but not yet; and FLM is built around a single-user/windows use case and just doesn't support the HF ecosystem properly FastFlowLM/FastFlowLM#406 , so dynamic search and add for new models doesn't fit the use case for a shared model shelf unless you are dragging around the proprietary folder format. Moving this back to draft. I may cherry pick a PR for the HF cache piece which is the best feature IMHO for anyone sharing model shelfs between users, OSs, etc. but gone for the next couple of weeks. @jeremyfowers |
Move "Downloaded only" toggle above section toggles for better discoverability, rename "Group suggested models" to "Suggested models". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
This builds on the recently added Hugging Face search support for GGUFs and extends search across all backends.
Some of the newer backends are currently pinned to a small set of authors. That still provides value: it can surface models that are not yet in the curated release set, such as newer Qwen3.5 models for FLM at the time of writing. It also gives us room to expand coverage as backend support evolves, for example as whisper.cpp gains support for newer GGUF formats.
filter=gguffilter=safetensors,text-to-imagefilter=onnx,text-to-speechauthor=ggerganov(pinned)author=FastFlowLM(pinned)author=amd+filter=onnxA helper script,
tests/hf_model_tags.py, queries the HF API and reportspipeline_tagplus relevant library, task, and format tags for suggested models or any HF model ID. It is useful for building and refining backend filter logic, especially for evolving image and audio cases.For example, llama.cpp is currently treated as “all GGUF except unsupported tasks.” If overlap with other GGUF-consuming backends becomes too broad, we can invert that logic and move to a more additive model. Analyzing the current suggested models reveals tag options for a more targeted approach noting 15 models have no pipeline tags upstream.
~/src/lemonade/test$ python3 hf_model_tags.py --summary --llamacpp ... ============================================================ TAG SUMMARY BY RECIPE ============================================================ [llamacpp] (57 models) pipeline tags: (none)×15, image-text-to-text, sentence-similarity, text-generation, text-ranking formats: gguf tasks: conversational, feature-extraction, image-text-to-text, sentence-similarity, text-generation, text-ranking libraries: (none)×15, llama.cpp, pytorch, sentence-transformers, transformers, transformers.js, vllm other: af, am, ar, az, ba, be, bg, bn, bs, ca, ce, chat, co, code, codeqwen, cross-encoder, cs, custom_code, cy, da, de, deepseek, deploy:azure, dv, edge, el, en, endpoints_compatible, eo, es, et, eu, fa, facebook, fi, fr, fy, ga, gd, gemma, gemma3, gguf-my-repo, gl, gn, google, gpt_oss, granite-4.0, gu, gv, ha, he, hi, hr, ht, hu, hy, id, ig, image-generation, imatrix, it, ja, jv, km, kn, ko, ku, ky, la, language, lfm2, lfm2.5, liquid, llama, llama-3, llama-4, llama-cpp, llama4, lo, lt, lv, math, meta, mg, mi, microsoft, mistral-common, mk, ml, mn, moe, mr, ms, multilingual, mxfp4, my, ne, nl, nlp, nn, no, nvidia, ny, openai, pa, phi, phi3, phi4, pl, prompt-compression, prompt-engineering, prompt-expansion, ps, pt, q4_k_m, quantized, qwen, qwen-coder, qwen3, qwen3_5_moe, qwen3_moe, qwen3_next, reranker, ro, ru, sd, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, text-embeddings-inference, tg, th, tl, tn, tr, ug, uk, unsloth, ur, uz, vi, xh, yi, yo, zh, zuWhich might lead to something like below if the GGUF filter with task filters draws in too many unsupported models but missed some without pipeline tags.
Model Repo Links
HF search often requires opening the repo page to inspect quant sizes, runtime flags, chat template details, and example usage. This PR adds the repo link to each searched model so users can quickly click through and validate details in a new tab.
HF search proxy with auth
GET /hf/searchproxy that forwards server-sideHF_TOKEN, doubling the default rate limit from 500 to 1000 requests per 5 minutes‹ N ›controlsRateLimitresponse headerHF_TOKENwhen unauthenticatedauthorparameter in the proxy for pinned-provider queriesHF Cache
The HF cache is often shared across tools and services. Exposing it in the model manager lets Lemonade discover models that were added outside Lemonade and register them without re-downloading. This is especially useful on systems with shared model stores, multi-OS setups, or large mounted repositories where users want to “take models off the shelf” into Lemonade and optionally leave the files in place when removing them later.
This PR adds HF cache discovery and a UI option to keep cached files when removing a model from Lemonade.
GET /cache/modelsto scan the local HF cache for downloaded models that are not yet registeredkeep_filesoption so removal can unregister the model without deleting cached filesFilter Dialog
The filter dialog now supports including or excluding any backend from search, so users can narrow results to the backends they care about.
It also makes the active filter state more visible by reflecting both default behavior and current backend availability. Backends that are not active are automatically excluded from search. Users can also hide the new HF Cache and HF Online sections to restore the previous suggested-models-only layout.
Model Quants
Improves quant detection and ordering across flat and nested folder layouts.
q4kand extended names likeUD-Q3_K_XLCore: recipe-aware classification
recipeCompatibility.tswithclassifyModel(), which prioritizes Hugging Facepipeline_tagmetadata over file-format-only detectionRECIPE_FORMATSto gate classification by file-format compatibility, for example preventingsd-cppfrom matching GGUF modelspipeline_tag→ repo tags → name pattern matching?badge and requires confirmation before installVision model (
mmproj) handlingmmproj_filesfrom the cache endpoint for vision model detectionmmprojfiles are found, opens the Add Model dialog for user selectionmmprojfile is found, auto-selects it and installs directlyBF16 > F16 > F32when picking a defaultmmprojOther UX improvements
[object Object]error toast for model load failuresBackend fixes
HttpClient::getto capture response headers for rate-limit parsingFiles changed
src/app/src/renderer/utils/recipeCompatibility.tssrc/app/src/renderer/ModelManager.tsxsrc/app/src/renderer/AddModelPanel.tsxmmprojdefault selection, recipe mismatch warningsrc/app/src/renderer/ConfirmDialog.tsxsrc/app/src/renderer/components/ConnectedBackendRow.tsxsrc/app/src/renderer/utils/backendInstaller.tslabelsfield,keep_filesparam, error message fixsrc/app/styles.csssrc/cpp/include/lemon/model_manager.hdiscover_hf_cache_models(),delete_model(keep_files)src/cpp/server/model_manager.cppsrc/cpp/server/server.cpp/cache/modelsand/hf/searchendpoints,keep_filessupportsrc/cpp/server/utils/http_client.cppTest plan
stable-diffusion— SD models should show thesd.cppbadge, notllama.cppwhisper— only.bin-based whisper models should be shown; GGUF-only models should be hiddenQwen3-VL— vision models should show thellama.cppbadge and detectmmprojmmprojfiles opens the edit dialog when multiple files are present, or installs directly when only one exists‹ ›controls work, and rate-limit cooldown adapts to remaining quotaHF_TOKENin the environment doubles the rate limit; verifyauthenticated: truein the proxy responseTested on Ubuntu 26.04 via the web interface using
lemonade-server.🤖 Generated with Claude Code using the 1M context window on Opus with no compaction