You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 24, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+49-16Lines changed: 49 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,16 +34,21 @@ LLaMA Box supports the following platforms.
34
34
> The GGUF model files used in the following examples are downloaded via LM Studio.
35
35
36
36
- Chat completion via [Nous-Hermes-2-Mistral-7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO)
37
-
model.
37
+
model. Use GGUF files
38
+
from [NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF/tree/main?show_file_info=Nous-Hermes-2-Mistral-7B-DPO.Q5_K_M.gguf).
38
39
39
40
```shell
40
41
$ # Provide 4 sessions(allowing 4 parallel chat users), with a max of 2048 tokens per session.
$ ./llama-box/tools/chat.sh "Introduce Beijing in 50 words."
44
48
```
45
49
46
-
- Legacy completion via [GLM-4-9B-Chat](https://huggingface.co/THUDM/glm-4-9b-chat) model.
50
+
- Legacy completion via [GLM-4-9B-Chat](https://huggingface.co/THUDM/glm-4-9b-chat) model. Use GGUF files
51
+
from [second-state/glm-4-9b-chat-GGUF](https://huggingface.co/second-state/glm-4-9b-chat-GGUF/tree/main?show_file_info=glm-4-9b-chat-Q5_K_M.gguf).
47
52
48
53
```shell
49
54
$ # Provide 4 session(allowing 4 parallel chat users), with a max of 2048 tokens per session.
@@ -52,26 +57,52 @@ LLaMA Box supports the following platforms.
52
57
$ curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "glm4", "prompt": "<|system|>You are a helpful assistant.<|user|>Tell me a joke.<|assistant|>"}'
53
58
```
54
59
55
-
- Vision explanation via [LLaVA-Phi-3-Mini](https://huggingface.co/xtuner/llava-phi-3-mini-hf) model.
60
+
- Vision explanation via [LLaVA-Phi-3-Mini](https://huggingface.co/xtuner/llava-phi-3-mini-hf) model. Use GGUF files
61
+
from [xtuner/llava-phi-3-mini-gguf](https://huggingface.co/xtuner/llava-phi-3-mini-gguf/tree/main?show_file_info=llava-phi-3-mini-f16.gguf).
56
62
57
63
```shell
58
64
$ # Provide 4 session(allowing 4 parallel chat users), with a max of 2048 tokens per session.
- Speculative decoding via [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
68
-
and [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) models.
76
+
- Draft model speculative decoding via [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
77
+
and [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) models. Use GGUF files
78
+
from [QuantFactory/Qwen2-7B-Instruct-GGUF](https://huggingface.co/QuantFactory/Qwen2-7B-Instruct-GGUF/tree/main?show_file_info=Qwen2-7B-Instruct.Q5_K_M.gguf)
79
+
and [QuantFactory/Qwen2-1.5B-Instruct-GGUF](https://huggingface.co/QuantFactory/Qwen2-1.5B-Instruct-GGUF/tree/main?show_file_info=Qwen2-1.5B-Instruct.Q5_K_M.gguf).
69
80
70
81
```shell
71
82
$ # Provide 4 session(allowing 4 parallel chat users), with a max of 2048 tokens per session.
$ curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "qwen2", "stream": true, "prompt": "Write a short story about a cat and a dog, more than 100 words."}'
86
+
87
+
$ # or use the chat.sh tool
88
+
$ ./llama-box/tools/chat.sh "Write a short story about a cat and a dog, more than 100 words."
89
+
```
90
+
91
+
- Lookup speculative decoding
92
+
via [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) models. Use GGUF files
93
+
from [QuantFactory/Mistral-Nemo-Instruct-2407-GGUF](https://huggingface.co/QuantFactory/Mistral-Nemo-Instruct-2407-GGUF/tree/main?show_file_info=Mistral-Nemo-Instruct-2407.Q5_K_M.gguf).
94
+
95
+
```shell
96
+
$ # Provide 2 session(allowing 2 parallel chat users), with a max of 8192 tokens per session.
0 commit comments