Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions docs/server/apps/claude-code.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Claude Code

Claude Code is a high-performance agentic coding CLI from Anthropic that can reason about your codebase, execute commands, and edit files. While it is natively designed for Anthropic's hosted models, **Lemonade Server** allows you to use Claude Code with **local models** by emulating the Anthropic API.

This setup provides a private, offline-capable coding assistant that lives in your terminal and has full access to your local development environment.

## Prerequisites

### 1. Install Claude Code
Claude Code can be installed via npm or using the official installation script:

**Using curl:**
```bash
curl -fsSL https://claude.ai/install.sh | bash
```

**Using npm:**
```bash
npm install -g @anthropic-ai/claude-code
```

### 2. Lemonade Server
Ensure you have Lemonade Server installed. If you haven't set it up yet, refer to the [Getting Started guide](https://lemonade-server.ai/install_options.html).

### 3. Download a Coding Model
Claude Code requires a model with strong instruction-following and tool-use capabilities. We recommend **Qwen3.5-35B-A3B-GGUF** or **GLM-4.7-Flash-GGUF**. These models require systems with 64 GB of RAM or more.

```bash
# Recommended for most coding tasks
lemonade-server pull Qwen3.5-35B-A3B-GGUF
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this model work well on its own, or is it best to use the ThinkingCoder.json recipe? That will impact whether we need to introduce the concept of custom model recipes in this guide, and how they should interact with the launch command.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ThinkingCoder recipe does have better performance for our agentic coding use case (the options are recommended by unsloth: https://unsloth.ai/docs/models/qwen3.5#recommended-settings). I do reference the wiki as a source of model recipes, so if we update that location, I think we should be ok.

I also elaborate on how launch should interact with recipes in my comment below, let me know what you think.


# High-performance alternative
lemonade-server pull GLM-4.7-Flash-GGUF
```

## Launching Claude Code

Lemonade provides a specialized `launch` command that streamlines the connection between the Claude Code CLI and your local server. It handles all necessary environment variables, including API redirection and performance optimizations.

### Step 1: Start Lemonade Server
In one terminal window, ensure the server is running:
```bash
lemonade-server serve --ctx-size 32768
```

We recommend starting the server with a context window size starting at 32768 tokens to accomodate for Claude Code's system prompt (20k+ tokens). Note that you might need to change this value depending on your hardware and project size.

### Step 2: Launch the Agent
Navigate to your project directory in another terminal and run:
```bash
lemonade-server launch claude -m Qwen3.5-35B-A3B-GGUF
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good spot for a screenshot. We can upload screenshots to https://github.com/lemonade-sdk/assets to avoid bloating this repo (see how that is used in the open webui guide in this same folder).

**What happens under the hood?**
When you execute the `launch` command, Lemonade Server initiates a **concurrent load** of the specified model on the backend. This means the server starts loading the model into memory in a background thread, allowing the Claude Code CLI interface to start instantly without blocking.
Comment on lines +40 to +55
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming here...

Option 1

Would this workflow be simpler if we used lemonade-server run Qwen3.5-35B-A3B-GGUF --ctx-size 32768 ?

That way:

  • No need for a separate pull command
  • No need to explain "Lemonade Server initiates a concurrent load" since run makes the load intentional.
  • Downside: right now run always pops open the Lemonade app, which is not desirable here.

Option 2

Should lemonade-server launch claude -m Qwen3.5-35B-A3B-GGUF:

  1. Start the lemonade server, pull, and load the model (just like run does)
  2. Accept any argument that run does: that way --ctx-size, --llamapp-args, etc. all come from the same place. (note: there is helper code already for generating the CLI for run-like commands)

Then the whole thing is one command.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning more towards option 2 here, using the API gives us more flexibility.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!


Additionally, the command automatically configures several critical environment variables:
- `ANTHROPIC_BASE_URL`: Redirects Claude Code to your local Lemonade instance.
- `ANTHROPIC_API_KEY`: Sets a local placeholder key.
- `ANTHROPIC_AUTH_TOKEN`: Sets a local placeholder token.
- `CLAUDE_CODE_ATTRIBUTION_HEADER`: Set to `0` to prevent KV cache invalidation, ensuring maximum inference speed.
- `CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`: Set to `1` to minimize telemetry and unnecessary external calls.

## Performance Expectations

### Concurrent Loading and Initial Delay
Because the model is loaded concurrently on the backend, if you send a query immediately after launching the CLI, you may experience an initial delay while the model finishes loading into memory.

### System Prompt Processing
The first query you send to Claude Code will take longer than subsequent ones, typically **30-40 seconds on a Strix Halo system**, though this varies based on your hardware and model selected.

This delay occurs because Claude Code sends a massive system prompt (20,000+ tokens) that defines its agentic behavior and tool-use capabilities. Once Lemonade processes this prompt, it is cached in the KV (Key-Value) cache, and subsequent queries will respond significantly faster as long as the session remains active.

## Best Practices for Strix Halo (128GB)

If you are using a **Strix Halo** system with **128GB of RAM**, you have a top-tier environment for local AI! You can run larger models with significantly expanded context windows.

### Recommended Model Choices - MoE models with strong agentic capabilities
* **Qwen3.5-35B-A3B-GGUF**: A Mixture of Experts (MoE) model that excels at agentic tasks. It has 35B parameters, 3B of which are active at a time.
* **Qwen3-Coder-Next**: An MoE model designed specifically for coding agents and local development with 80B total parameters, 3B of which are activated
* **GLM-4.7-Flash-GGUF**: An excellent alternative for rapid iterations and complex instruction following, a 30B-A3B MoE model

### Custom Tuning with `llamacpp-args`
By default, Lemonade uses the coding defaults (`-b 4096 -ub 1024 -fa on`). Using the `--llamacpp-args` flag, you can customize the parameters passed to llama-server when the model is being loaded.
```bash
lemonade-server launch claude -m Qwen3.5-35B-A3B-GGUF --llamacpp-args "-b 1024 -ub 1024 -fa on"
```

- `-b` and `-ub`: These control the batch size and physical batch size. While the llama.cpp default is `512`, increasing these (e.g., to `1024` or `2048`) helps saturate memory bandwidth on high-end hardware like Strix Halo, though it increases RAM/VRAM consumption at the same time.
- `-fa on`: Enables Flash Attention for optimized performance

### Using Model Recipes
While you can manually pass arguments with `--llamacpp-args`, a more scalable approach is to use the model's saved configuration by passing `--use-recipe`.

```bash
lemonade-server launch claude -m Qwen3.5-35B-A3B-GGUF --use-recipe
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my quandry with the workflow: Qwen3.5-35B-A3B-ThinkingCoder.json uses some different settings than what you recommend above.

I need a way to combine your recommendations with the general Qwen3.5 Thinking suggestions from that recipe.

Option 1

The simplest way would be to populate the Lemonade Wiki with some specific recipe.json files for Claude Code. That way it is easy to use those as a starting point with --use-recipe. In that case, this guide should reference those recipe files and not the base models like Qwen3.5-35B-A3B-GGUF throughout.

Option 2

lemonade-server launch itself needs a way to programmatically do graceful blends of Claude Code specific args with generic args from recipes like Qwen3.5-35B-A3B-ThinkingCoder.json.

But Option 1 seems way easier.

wdyt?

```

**How Recipes Work**
When `--use-recipe` is invoked, Lemonade skips the default `launch` arguments and instead reads from your `recipe_options.json` file. This file stores per-model runtime settings (like context window size, batch size, and hardware acceleration backends).

- **Location:** You can find or edit this file directly in your Lemonade cache directory:
- **Linux/macOS:** `~/.cache/lemonade/recipe_options.json`
- **Windows:** `%USERPROFILE%\.cache\lemonade\recipe_options.json`
- **Keys:** Entries in this JSON file use the full prefixed model name (e.g., `"user.Qwen3.5-35B-A3B-GGUF"`).

**Importing via Web UI**
Instead of manually editing the JSON file, you can also easily add recipes using the Lemonade Web Interface:
1. Open the Lemonade Web Interface (usually `http://localhost:8000`).
2. Navigate to the model management section.
3. Click on **"Import a model"**.
4. Upload the recipe configuration.
Comment on lines +107 to +112
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bitgamma is there a way to import a recipe.json file using the CLI? That would blend a lot smoother into this guide.


**Settings Priority**
When loading a model for a launched agent, Lemonade Server resolves settings in this order (highest priority first):
1. Explicit values passed in the load request (e.g., using `--llamacpp-args` via CLI).
2. Per-model values defined in `recipe_options.json` (used when `--use-recipe` is active).
3. Global environment variables (e.g., `LEMONADE_MAX_LOADED_MODELS`).
4. Hardcoded system defaults.
Comment on lines +114 to +119
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that by streamlining the workflow above we can avoid the need for documentation like this. Let's simplify!


For community-tested configurations optimized for specific hardware setups, check out the [Lemonade Recipes Wiki](https://github.com/lemonade-sdk/lemonade/wiki/Recipes) (Note: these are currently a work in progress).

## What's realistic to achieve?

Local agents are incredibly useful, but they have different strengths than the giant models running in the cloud.

**Where they work well:**
Local models work well for focused, well-defined tasks. If you need to refactor a specific module or generate some boilerplate they work great.

**Where they fall short:**
Local models aren't quite ready to handle massive, project-wide refactors that touch dozens of interdependent files. They can also sometimes lose their way if they hit an unexpected error halfway through a complex task.

## Troubleshooting

### Login Prompt appears
If Claude Code asks you to log in to Anthropic, it means the environment variables weren't picked up correctly. Ensure you are using `lemonade-server launch claude` rather than calling `claude` directly.

### Performance is Slow
- Verify that `Flash Attention` is enabled in your `llamacpp-args`.
- Check that no other heavy applications are consuming your GPU resources.
- If you are using a very large context (`-c`), the "prefill" time (time to first token) will increase.

### "Permission Denied" when editing files
Claude Code may ask for permission to run commands or edit files. You can run it with the `--dangerously-skip-permissions` flag if you trust the model in a sandbox, but manual approval is recommended for safety.

---
*For more information on Claude Code's capabilities, visit the [Anthropic Documentation](https://docs.anthropic.com/en/docs/agents-and-tools/claude-code).*
2 changes: 1 addition & 1 deletion src/cpp/tray/tray_app.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ static std::string trim_whitespace(const std::string& value) {
}

static std::string build_launch_llamacpp_args(const lemon::TrayConfig& tray_config) {
static const std::string default_args = "-b 16384 -ub 16384 -fa on";
static const std::string default_args = "-b 4096 -ub 1024 -fa on";
const std::string trimmed_user_args = trim_whitespace(tray_config.launch_llamacpp_args);

if (tray_config.launch_use_recipe) {
Expand Down