-
Notifications
You must be signed in to change notification settings - Fork 224
Add a user guide for Claude Code integration #1334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
aa7057f
640bbbd
1533fb9
44d3f6b
c24f0a8
5d2738d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,147 @@ | ||
| # Claude Code | ||
|
|
||
| Claude Code is a high-performance agentic coding CLI from Anthropic that can reason about your codebase, execute commands, and edit files. While it is natively designed for Anthropic's hosted models, **Lemonade Server** allows you to use Claude Code with **local models** by emulating the Anthropic API. | ||
|
|
||
| This setup provides a private, offline-capable coding assistant that lives in your terminal and has full access to your local development environment. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| ### 1. Install Claude Code | ||
| Claude Code can be installed via npm or using the official installation script: | ||
|
|
||
| **Using curl:** | ||
| ```bash | ||
| curl -fsSL https://claude.ai/install.sh | bash | ||
| ``` | ||
|
|
||
| **Using npm:** | ||
| ```bash | ||
| npm install -g @anthropic-ai/claude-code | ||
| ``` | ||
|
|
||
| ### 2. Lemonade Server | ||
| Ensure you have Lemonade Server installed. If you haven't set it up yet, refer to the [Getting Started guide](https://lemonade-server.ai/install_options.html). | ||
|
|
||
| ### 3. Download a Coding Model | ||
| Claude Code requires a model with strong instruction-following and tool-use capabilities. We recommend **Qwen3.5-35B-A3B-GGUF** or **GLM-4.7-Flash-GGUF**. These models require systems with 64 GB of RAM or more. | ||
|
|
||
| ```bash | ||
| # Recommended for most coding tasks | ||
| lemonade-server pull Qwen3.5-35B-A3B-GGUF | ||
|
|
||
| # High-performance alternative | ||
| lemonade-server pull GLM-4.7-Flash-GGUF | ||
| ``` | ||
|
|
||
| ## Launching Claude Code | ||
|
|
||
| Lemonade provides a specialized `launch` command that streamlines the connection between the Claude Code CLI and your local server. It handles all necessary environment variables, including API redirection and performance optimizations. | ||
|
|
||
| ### Step 1: Start Lemonade Server | ||
| In one terminal window, ensure the server is running: | ||
| ```bash | ||
| lemonade-server serve --ctx-size 32768 | ||
| ``` | ||
|
|
||
| We recommend starting the server with a context window size starting at 32768 tokens to accomodate for Claude Code's system prompt (20k+ tokens). Note that you might need to change this value depending on your hardware and project size. | ||
|
|
||
| ### Step 2: Launch the Agent | ||
| Navigate to your project directory in another terminal and run: | ||
| ```bash | ||
| lemonade-server launch claude -m Qwen3.5-35B-A3B-GGUF | ||
| ``` | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be a good spot for a screenshot. We can upload screenshots to https://github.com/lemonade-sdk/assets to avoid bloating this repo (see how that is used in the open webui guide in this same folder). |
||
| **What happens under the hood?** | ||
| When you execute the `launch` command, Lemonade Server initiates a **concurrent load** of the specified model on the backend. This means the server starts loading the model into memory in a background thread, allowing the Claude Code CLI interface to start instantly without blocking. | ||
|
Comment on lines
+40
to
+55
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Brainstorming here... Option 1Would this workflow be simpler if we used That way:
Option 2Should
Then the whole thing is one command.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm leaning more towards option 2 here, using the API gives us more flexibility.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good! |
||
|
|
||
| Additionally, the command automatically configures several critical environment variables: | ||
| - `ANTHROPIC_BASE_URL`: Redirects Claude Code to your local Lemonade instance. | ||
| - `ANTHROPIC_API_KEY`: Sets a local placeholder key. | ||
| - `ANTHROPIC_AUTH_TOKEN`: Sets a local placeholder token. | ||
| - `CLAUDE_CODE_ATTRIBUTION_HEADER`: Set to `0` to prevent KV cache invalidation, ensuring maximum inference speed. | ||
| - `CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`: Set to `1` to minimize telemetry and unnecessary external calls. | ||
|
|
||
| ## Performance Expectations | ||
|
|
||
| ### Concurrent Loading and Initial Delay | ||
| Because the model is loaded concurrently on the backend, if you send a query immediately after launching the CLI, you may experience an initial delay while the model finishes loading into memory. | ||
|
|
||
| ### System Prompt Processing | ||
| The first query you send to Claude Code will take longer than subsequent ones, typically **30-40 seconds on a Strix Halo system**, though this varies based on your hardware and model selected. | ||
|
|
||
| This delay occurs because Claude Code sends a massive system prompt (20,000+ tokens) that defines its agentic behavior and tool-use capabilities. Once Lemonade processes this prompt, it is cached in the KV (Key-Value) cache, and subsequent queries will respond significantly faster as long as the session remains active. | ||
|
|
||
| ## Best Practices for Strix Halo (128GB) | ||
|
|
||
| If you are using a **Strix Halo** system with **128GB of RAM**, you have a top-tier environment for local AI! You can run larger models with significantly expanded context windows. | ||
|
|
||
| ### Recommended Model Choices - MoE models with strong agentic capabilities | ||
| * **Qwen3.5-35B-A3B-GGUF**: A Mixture of Experts (MoE) model that excels at agentic tasks. It has 35B parameters, 3B of which are active at a time. | ||
| * **Qwen3-Coder-Next**: An MoE model designed specifically for coding agents and local development with 80B total parameters, 3B of which are activated | ||
| * **GLM-4.7-Flash-GGUF**: An excellent alternative for rapid iterations and complex instruction following, a 30B-A3B MoE model | ||
|
|
||
| ### Custom Tuning with `llamacpp-args` | ||
| By default, Lemonade uses the coding defaults (`-b 4096 -ub 1024 -fa on`). Using the `--llamacpp-args` flag, you can customize the parameters passed to llama-server when the model is being loaded. | ||
| ```bash | ||
| lemonade-server launch claude -m Qwen3.5-35B-A3B-GGUF --llamacpp-args "-b 1024 -ub 1024 -fa on" | ||
| ``` | ||
|
|
||
| - `-b` and `-ub`: These control the batch size and physical batch size. While the llama.cpp default is `512`, increasing these (e.g., to `1024` or `2048`) helps saturate memory bandwidth on high-end hardware like Strix Halo, though it increases RAM/VRAM consumption at the same time. | ||
| - `-fa on`: Enables Flash Attention for optimized performance | ||
|
|
||
| ### Using Model Recipes | ||
| While you can manually pass arguments with `--llamacpp-args`, a more scalable approach is to use the model's saved configuration by passing `--use-recipe`. | ||
|
|
||
| ```bash | ||
| lemonade-server launch claude -m Qwen3.5-35B-A3B-GGUF --use-recipe | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's my quandry with the workflow: Qwen3.5-35B-A3B-ThinkingCoder.json uses some different settings than what you recommend above. I need a way to combine your recommendations with the general Qwen3.5 Thinking suggestions from that recipe. Option 1The simplest way would be to populate the Lemonade Wiki with some specific recipe.json files for Claude Code. That way it is easy to use those as a starting point with --use-recipe. In that case, this guide should reference those recipe files and not the base models like Option 2
But Option 1 seems way easier. wdyt? |
||
| ``` | ||
|
|
||
| **How Recipes Work** | ||
| When `--use-recipe` is invoked, Lemonade skips the default `launch` arguments and instead reads from your `recipe_options.json` file. This file stores per-model runtime settings (like context window size, batch size, and hardware acceleration backends). | ||
|
|
||
| - **Location:** You can find or edit this file directly in your Lemonade cache directory: | ||
| - **Linux/macOS:** `~/.cache/lemonade/recipe_options.json` | ||
| - **Windows:** `%USERPROFILE%\.cache\lemonade\recipe_options.json` | ||
| - **Keys:** Entries in this JSON file use the full prefixed model name (e.g., `"user.Qwen3.5-35B-A3B-GGUF"`). | ||
|
|
||
| **Importing via Web UI** | ||
| Instead of manually editing the JSON file, you can also easily add recipes using the Lemonade Web Interface: | ||
| 1. Open the Lemonade Web Interface (usually `http://localhost:8000`). | ||
| 2. Navigate to the model management section. | ||
| 3. Click on **"Import a model"**. | ||
| 4. Upload the recipe configuration. | ||
|
Comment on lines
+107
to
+112
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bitgamma is there a way to import a recipe.json file using the CLI? That would blend a lot smoother into this guide. |
||
|
|
||
| **Settings Priority** | ||
| When loading a model for a launched agent, Lemonade Server resolves settings in this order (highest priority first): | ||
| 1. Explicit values passed in the load request (e.g., using `--llamacpp-args` via CLI). | ||
| 2. Per-model values defined in `recipe_options.json` (used when `--use-recipe` is active). | ||
| 3. Global environment variables (e.g., `LEMONADE_MAX_LOADED_MODELS`). | ||
| 4. Hardcoded system defaults. | ||
|
Comment on lines
+114
to
+119
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I hope that by streamlining the workflow above we can avoid the need for documentation like this. Let's simplify! |
||
|
|
||
| For community-tested configurations optimized for specific hardware setups, check out the [Lemonade Recipes Wiki](https://github.com/lemonade-sdk/lemonade/wiki/Recipes) (Note: these are currently a work in progress). | ||
|
|
||
| ## What's realistic to achieve? | ||
|
|
||
| Local agents are incredibly useful, but they have different strengths than the giant models running in the cloud. | ||
|
|
||
| **Where they work well:** | ||
| Local models work well for focused, well-defined tasks. If you need to refactor a specific module or generate some boilerplate they work great. | ||
|
|
||
| **Where they fall short:** | ||
| Local models aren't quite ready to handle massive, project-wide refactors that touch dozens of interdependent files. They can also sometimes lose their way if they hit an unexpected error halfway through a complex task. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Login Prompt appears | ||
| If Claude Code asks you to log in to Anthropic, it means the environment variables weren't picked up correctly. Ensure you are using `lemonade-server launch claude` rather than calling `claude` directly. | ||
|
|
||
| ### Performance is Slow | ||
| - Verify that `Flash Attention` is enabled in your `llamacpp-args`. | ||
| - Check that no other heavy applications are consuming your GPU resources. | ||
| - If you are using a very large context (`-c`), the "prefill" time (time to first token) will increase. | ||
|
|
||
| ### "Permission Denied" when editing files | ||
| Claude Code may ask for permission to run commands or edit files. You can run it with the `--dangerously-skip-permissions` flag if you trust the model in a sandbox, but manual approval is recommended for safety. | ||
|
|
||
| --- | ||
| *For more information on Claude Code's capabilities, visit the [Anthropic Documentation](https://docs.anthropic.com/en/docs/agents-and-tools/claude-code).* | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this model work well on its own, or is it best to use the ThinkingCoder.json recipe? That will impact whether we need to introduce the concept of custom model recipes in this guide, and how they should interact with the
launchcommand.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the ThinkingCoder recipe does have better performance for our agentic coding use case (the options are recommended by unsloth: https://unsloth.ai/docs/models/qwen3.5#recommended-settings). I do reference the wiki as a source of model recipes, so if we update that location, I think we should be ok.
I also elaborate on how
launchshould interact with recipes in my comment below, let me know what you think.