diff --git a/docs/server/apps/claude-code.md b/docs/server/apps/claude-code.md new file mode 100644 index 000000000..ce30245d7 --- /dev/null +++ b/docs/server/apps/claude-code.md @@ -0,0 +1,147 @@ +# Claude Code + +Claude Code is a high-performance agentic coding CLI from Anthropic that can reason about your codebase, execute commands, and edit files. While it is natively designed for Anthropic's hosted models, **Lemonade Server** allows you to use Claude Code with **local models** by emulating the Anthropic API. + +This setup provides a private, offline-capable coding assistant that lives in your terminal and has full access to your local development environment. + +## Prerequisites + +### 1. Install Claude Code +Claude Code can be installed via npm or using the official installation script: + +**Using curl:** +```bash +curl -fsSL https://claude.ai/install.sh | bash +``` + +**Using npm:** +```bash +npm install -g @anthropic-ai/claude-code +``` + +### 2. Lemonade Server +Ensure you have Lemonade Server installed. If you haven't set it up yet, refer to the [Getting Started guide](https://lemonade-server.ai/install_options.html). + +### 3. Download a Coding Model +Claude Code requires a model with strong instruction-following and tool-use capabilities. We recommend **Qwen3.5-35B-A3B-GGUF** or **GLM-4.7-Flash-GGUF**. These models require systems with 64 GB of RAM or more. + +```bash +# Recommended for most coding tasks +lemonade-server pull Qwen3.5-35B-A3B-GGUF + +# High-performance alternative +lemonade-server pull GLM-4.7-Flash-GGUF +``` + +## Launching Claude Code + +Lemonade provides a specialized `launch` command that streamlines the connection between the Claude Code CLI and your local server. It handles all necessary environment variables, including API redirection and performance optimizations. + +### Step 1: Start Lemonade Server +In one terminal window, ensure the server is running: +```bash +lemonade-server serve --ctx-size 32768 +``` + +We recommend starting the server with a context window size starting at 32768 tokens to accomodate for Claude Code's system prompt (20k+ tokens). Note that you might need to change this value depending on your hardware and project size. + +### Step 2: Launch the Agent +Navigate to your project directory in another terminal and run: +```bash +lemonade-server launch claude -m Qwen3.5-35B-A3B-GGUF +``` + +**What happens under the hood?** +When you execute the `launch` command, Lemonade Server initiates a **concurrent load** of the specified model on the backend. This means the server starts loading the model into memory in a background thread, allowing the Claude Code CLI interface to start instantly without blocking. + +Additionally, the command automatically configures several critical environment variables: +- `ANTHROPIC_BASE_URL`: Redirects Claude Code to your local Lemonade instance. +- `ANTHROPIC_API_KEY`: Sets a local placeholder key. +- `ANTHROPIC_AUTH_TOKEN`: Sets a local placeholder token. +- `CLAUDE_CODE_ATTRIBUTION_HEADER`: Set to `0` to prevent KV cache invalidation, ensuring maximum inference speed. +- `CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`: Set to `1` to minimize telemetry and unnecessary external calls. + +## Performance Expectations + +### Concurrent Loading and Initial Delay +Because the model is loaded concurrently on the backend, if you send a query immediately after launching the CLI, you may experience an initial delay while the model finishes loading into memory. + +### System Prompt Processing +The first query you send to Claude Code will take longer than subsequent ones, typically **30-40 seconds on a Strix Halo system**, though this varies based on your hardware and model selected. + +This delay occurs because Claude Code sends a massive system prompt (20,000+ tokens) that defines its agentic behavior and tool-use capabilities. Once Lemonade processes this prompt, it is cached in the KV (Key-Value) cache, and subsequent queries will respond significantly faster as long as the session remains active. + +## Best Practices for Strix Halo (128GB) + +If you are using a **Strix Halo** system with **128GB of RAM**, you have a top-tier environment for local AI! You can run larger models with significantly expanded context windows. + +### Recommended Model Choices - MoE models with strong agentic capabilities +* **Qwen3.5-35B-A3B-GGUF**: A Mixture of Experts (MoE) model that excels at agentic tasks. It has 35B parameters, 3B of which are active at a time. +* **Qwen3-Coder-Next**: An MoE model designed specifically for coding agents and local development with 80B total parameters, 3B of which are activated +* **GLM-4.7-Flash-GGUF**: An excellent alternative for rapid iterations and complex instruction following, a 30B-A3B MoE model + +### Custom Tuning with `llamacpp-args` +By default, Lemonade uses the coding defaults (`-b 4096 -ub 1024 -fa on`). Using the `--llamacpp-args` flag, you can customize the parameters passed to llama-server when the model is being loaded. +```bash +lemonade-server launch claude -m Qwen3.5-35B-A3B-GGUF --llamacpp-args "-b 1024 -ub 1024 -fa on" +``` + +- `-b` and `-ub`: These control the batch size and physical batch size. While the llama.cpp default is `512`, increasing these (e.g., to `1024` or `2048`) helps saturate memory bandwidth on high-end hardware like Strix Halo, though it increases RAM/VRAM consumption at the same time. +- `-fa on`: Enables Flash Attention for optimized performance + +### Using Model Recipes +While you can manually pass arguments with `--llamacpp-args`, a more scalable approach is to use the model's saved configuration by passing `--use-recipe`. + +```bash +lemonade-server launch claude -m Qwen3.5-35B-A3B-GGUF --use-recipe +``` + +**How Recipes Work** +When `--use-recipe` is invoked, Lemonade skips the default `launch` arguments and instead reads from your `recipe_options.json` file. This file stores per-model runtime settings (like context window size, batch size, and hardware acceleration backends). + +- **Location:** You can find or edit this file directly in your Lemonade cache directory: + - **Linux/macOS:** `~/.cache/lemonade/recipe_options.json` + - **Windows:** `%USERPROFILE%\.cache\lemonade\recipe_options.json` +- **Keys:** Entries in this JSON file use the full prefixed model name (e.g., `"user.Qwen3.5-35B-A3B-GGUF"`). + +**Importing via Web UI** +Instead of manually editing the JSON file, you can also easily add recipes using the Lemonade Web Interface: +1. Open the Lemonade Web Interface (usually `http://localhost:8000`). +2. Navigate to the model management section. +3. Click on **"Import a model"**. +4. Upload the recipe configuration. + +**Settings Priority** +When loading a model for a launched agent, Lemonade Server resolves settings in this order (highest priority first): +1. Explicit values passed in the load request (e.g., using `--llamacpp-args` via CLI). +2. Per-model values defined in `recipe_options.json` (used when `--use-recipe` is active). +3. Global environment variables (e.g., `LEMONADE_MAX_LOADED_MODELS`). +4. Hardcoded system defaults. + +For community-tested configurations optimized for specific hardware setups, check out the [Lemonade Recipes Wiki](https://github.com/lemonade-sdk/lemonade/wiki/Recipes) (Note: these are currently a work in progress). + +## What's realistic to achieve? + +Local agents are incredibly useful, but they have different strengths than the giant models running in the cloud. + +**Where they work well:** +Local models work well for focused, well-defined tasks. If you need to refactor a specific module or generate some boilerplate they work great. + +**Where they fall short:** +Local models aren't quite ready to handle massive, project-wide refactors that touch dozens of interdependent files. They can also sometimes lose their way if they hit an unexpected error halfway through a complex task. + +## Troubleshooting + +### Login Prompt appears +If Claude Code asks you to log in to Anthropic, it means the environment variables weren't picked up correctly. Ensure you are using `lemonade-server launch claude` rather than calling `claude` directly. + +### Performance is Slow +- Verify that `Flash Attention` is enabled in your `llamacpp-args`. +- Check that no other heavy applications are consuming your GPU resources. +- If you are using a very large context (`-c`), the "prefill" time (time to first token) will increase. + +### "Permission Denied" when editing files +Claude Code may ask for permission to run commands or edit files. You can run it with the `--dangerously-skip-permissions` flag if you trust the model in a sandbox, but manual approval is recommended for safety. + +--- +*For more information on Claude Code's capabilities, visit the [Anthropic Documentation](https://docs.anthropic.com/en/docs/agents-and-tools/claude-code).* diff --git a/src/cpp/tray/tray_app.cpp b/src/cpp/tray/tray_app.cpp index 799646aa8..bb17c9ed0 100644 --- a/src/cpp/tray/tray_app.cpp +++ b/src/cpp/tray/tray_app.cpp @@ -96,7 +96,7 @@ static std::string trim_whitespace(const std::string& value) { } static std::string build_launch_llamacpp_args(const lemon::TrayConfig& tray_config) { - static const std::string default_args = "-b 16384 -ub 16384 -fa on"; + static const std::string default_args = "-b 4096 -ub 1024 -fa on"; const std::string trimmed_user_args = trim_whitespace(tray_config.launch_llamacpp_args); if (tray_config.launch_use_recipe) {