Skip to content
This repository was archived by the owner on Nov 24, 2025. It is now read-only.

Commit f0c64df

Browse files
committed
feat: support lookup
Signed-off-by: thxCode <thxcode0824@gmail.com>
1 parent 33d7012 commit f0c64df

File tree

5 files changed

+250
-97
lines changed

5 files changed

+250
-97
lines changed

README.md

Lines changed: 49 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -34,16 +34,21 @@ LLaMA Box supports the following platforms.
3434
> The GGUF model files used in the following examples are downloaded via LM Studio.
3535
3636
- Chat completion via [Nous-Hermes-2-Mistral-7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO)
37-
model.
37+
model. Use GGUF files
38+
from [NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF/tree/main?show_file_info=Nous-Hermes-2-Mistral-7B-DPO.Q5_K_M.gguf).
3839

3940
```shell
4041
$ # Provide 4 sessions(allowing 4 parallel chat users), with a max of 2048 tokens per session.
41-
$ llama-box -c 8192 -np 4 --host 0.0.0.0 -m ~/.cache/lm-studio/models/NousResearch/Nous-Hermes-2-Mistral-7B-DPO/Nous-Hermes-2-Mistral-7B-DPO.Q5_K_M.gguf
42+
$ llama-box -c 8192 -np 4 --host 0.0.0.0 -m ~/.cache/lm-studio/models/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF/Nous-Hermes-2-Mistral-7B-DPO.Q5_K_M.gguf
4243

43-
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "qwen2", "messages": [{"role":"user", "content":"Introduce Beijing in 50 words."}]}'
44+
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "hermes2", "messages": [{"role":"user", "content":"Introduce Beijing in 50 words."}]}'
45+
46+
$ # or use the chat.sh tool
47+
$ ./llama-box/tools/chat.sh "Introduce Beijing in 50 words."
4448
```
4549

46-
- Legacy completion via [GLM-4-9B-Chat](https://huggingface.co/THUDM/glm-4-9b-chat) model.
50+
- Legacy completion via [GLM-4-9B-Chat](https://huggingface.co/THUDM/glm-4-9b-chat) model. Use GGUF files
51+
from [second-state/glm-4-9b-chat-GGUF](https://huggingface.co/second-state/glm-4-9b-chat-GGUF/tree/main?show_file_info=glm-4-9b-chat-Q5_K_M.gguf).
4752

4853
```shell
4954
$ # Provide 4 session(allowing 4 parallel chat users), with a max of 2048 tokens per session.
@@ -52,26 +57,52 @@ LLaMA Box supports the following platforms.
5257
$ curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "glm4", "prompt": "<|system|>You are a helpful assistant.<|user|>Tell me a joke.<|assistant|>"}'
5358
```
5459

55-
- Vision explanation via [LLaVA-Phi-3-Mini](https://huggingface.co/xtuner/llava-phi-3-mini-hf) model.
60+
- Vision explanation via [LLaVA-Phi-3-Mini](https://huggingface.co/xtuner/llava-phi-3-mini-hf) model. Use GGUF files
61+
from [xtuner/llava-phi-3-mini-gguf](https://huggingface.co/xtuner/llava-phi-3-mini-gguf/tree/main?show_file_info=llava-phi-3-mini-f16.gguf).
5662

5763
```shell
5864
$ # Provide 4 session(allowing 4 parallel chat users), with a max of 2048 tokens per session.
5965
$ llama-box -c 8192 -np 4 --host 0.0.0.0 -m ~/.cache/lm-studio/models/xtuner/llava-phi-3-mini-gguf/llava-phi-3-mini-f16.gguf --mmproj ~/.cache/lm-studio/models/xtuner/llava-phi-3-mini-gguf/llava-phi-3-mini-mmproj-f16.gguf
6066
6167
$ IMAGE_URL="$(echo "data:image/jpeg;base64,$(curl https://llava.hliu.cc/file\=/nobackup/haotian/tmp/gradio/ca10383cc943e99941ecffdc4d34c51afb2da472/extreme_ironing.jpg --output - | base64)")"; \
62-
echo "{\"model\": \"llava-phi-3\", \"temperature\": 0.1, \"messages\": [{\"role\":\"user\", \"content\": [{\"type\": \"image_url\", \"image_url\": {\"url\": \"$IMAGE_URL\"}}, {\"type\": \"text\", \"text\": \"What is unusual about this image?\"}]}]}" > /tmp/llava-phi-3.json
68+
echo "{\"model\": \"llava-phi-3\", \"temperature\": 0.1, \"stop\": [\"<|end|>\"], \"messages\": [{\"role\":\"user\", \"content\": [{\"type\": \"image_url\", \"image_url\": {\"url\": \"$IMAGE_URL\"}}, {\"type\": \"text\", \"text\": \"What is unusual about this image?\"}]}]}" > /tmp/data.json
6369
64-
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d @/tmp/llava-phi-3.json
70+
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d @/tmp/data.json
71+
72+
$ # or use the chat.sh tool
73+
$ ./llama-box/tools/chat.sh @/tmp/data.json
6574
```
6675

67-
- Speculative decoding via [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
68-
and [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) models.
76+
- Draft model speculative decoding via [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
77+
and [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) models. Use GGUF files
78+
from [QuantFactory/Qwen2-7B-Instruct-GGUF](https://huggingface.co/QuantFactory/Qwen2-7B-Instruct-GGUF/tree/main?show_file_info=Qwen2-7B-Instruct.Q5_K_M.gguf)
79+
and [QuantFactory/Qwen2-1.5B-Instruct-GGUF](https://huggingface.co/QuantFactory/Qwen2-1.5B-Instruct-GGUF/tree/main?show_file_info=Qwen2-1.5B-Instruct.Q5_K_M.gguf).
6980

7081
```shell
7182
$ # Provide 4 session(allowing 4 parallel chat users), with a max of 2048 tokens per session.
7283
$ llama-box -c 8192 -np 4 --host 0.0.0.0 -m ~/.cache/lm-studio/models/QuantFactory/Qwen2-7B-Instruct-GGUF/Qwen2-7B-Instruct.Q5_K_M.gguf -md ~/.cache/lm-studio/models/QuantFactory/Qwen2-1.5B-Instruct-GGUF/Qwen2-1.5B-Instruct.Q5_K_M.gguf --draft 8
7384
7485
$ curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "qwen2", "stream": true, "prompt": "Write a short story about a cat and a dog, more than 100 words."}'
86+
87+
$ # or use the chat.sh tool
88+
$ ./llama-box/tools/chat.sh "Write a short story about a cat and a dog, more than 100 words."
89+
```
90+
91+
- Lookup speculative decoding
92+
via [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) models. Use GGUF files
93+
from [QuantFactory/Mistral-Nemo-Instruct-2407-GGUF](https://huggingface.co/QuantFactory/Mistral-Nemo-Instruct-2407-GGUF/tree/main?show_file_info=Mistral-Nemo-Instruct-2407.Q5_K_M.gguf).
94+
95+
```shell
96+
$ # Provide 2 session(allowing 2 parallel chat users), with a max of 8192 tokens per session.
97+
$ llama-box -c 16384 -np 2 --host 0.0.0.0 -m ~/.cache/lm-studio/models/QuantFactory/Mistral-Nemo-Instruct-2407-GGUF/Mistral-Nemo-Instruct-2407.Q5_K_M.gguf --lookup-ngram-min 1 --draft 8
98+
99+
$ CONTENT="$(curl https://en.wikipedia.org/w/api.php\?action\=query\&format\=json\&titles\=Medusa\&prop\=extracts\&exintro\&explaintext | jq '.query.pages | to_entries | .[0].value.extract | gsub("\n"; "\\n") | gsub("\t"; "\\t")')"; \
100+
echo "{\"model\": \"mistral-nemo\", \"stream\": true, \"messages\": [{\"role\":\"user\", \"content\": [{\"type\": \"text\", \"text\": \"Please read the following content and summarize the article in 5 sentences.\"}, {\"type\": \"text\", \"text\": "$CONTENT"}]}]}" > /tmp/data.json
101+
102+
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d @/tmp/data.json
103+
104+
$ # or use the chat.sh tool
105+
$ ./llama-box/tools/chat.sh @/tmp/data.json
75106
```
76107

77108
## Usage
@@ -88,10 +119,6 @@ general:
88119
-s, --seed N RNG seed (default: -1, use random seed for < 0)
89120
-t, --threads N number of threads to use during generation (default: 8)
90121
-tb, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)
91-
-lcs, --lookup-cache-static FILE
92-
path to static lookup cache to use for lookup decoding (not updated by generation)
93-
-lcd, --lookup-cache-dynamic FILE
94-
path to dynamic lookup cache to use for lookup decoding (updated by generation)
95122
-c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model)
96123
-n, --predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
97124
-b, --batch-size N logical maximum batch size (default: 2048)
@@ -215,11 +242,16 @@ logging:
215242
216243
speculative:
217244
245+
--draft N number of tokens to draft for speculative decoding (default: 5)
218246
-md, --model-draft FNAME draft model for speculative decoding (default: unused)
219247
-td, --threads-draft N number of threads to use during generation (default: same as --threads)
220248
-tbd, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft)
221-
--draft N number of tokens to draft for speculative decoding (default: 5)
222249
-ngld, --gpu-layers-draft N number of layers to store in VRAM for the draft model
250+
--lookup-ngram-min N minimum n-gram size for lookup cache (default: 0, 0 = disabled)
251+
-lcs, --lookup-cache-static FILE
252+
path to static lookup cache to use for lookup decoding (not updated by generation)
253+
-lcd, --lookup-cache-dynamic FILE
254+
path to dynamic lookup cache to use for lookup decoding (updated by generation)
223255
224256
```
225257
@@ -287,7 +319,8 @@ speculative:
287319
288320
## Tools
289321
290-
It was so hard to find a Chat UI that was directly compatible with OpenAI, that mean, no installation required (I can live
322+
It was so hard to find a Chat UI that was directly compatible with OpenAI, that mean, no installation required (I can
323+
live
291324
with `docker run`), no tokens (or optional), no [Ollama](https://github.com/ollama/ollama) required, just a simple
292325
RESTful API.
293326
@@ -301,7 +334,7 @@ All you need is a Bash shell and curl.
301334
- **chat.sh**: A simple script to interact with the `/v1/chat/completions` endpoint.
302335
303336
Both `completion.sh` and `chat.sh` are used for talking with the LLaMA Box,
304-
but `completion.sh` embeds a fixed pattern to format the given prompt format,
337+
but `completion.sh` embeds a fixed pattern to format the given prompt format,
305338
while `chat.sh` can leverage the chat template from the model's metadata or user defined.
306339
307340
```shell

0 commit comments

Comments
 (0)