Skip to content

Commit 0228952

Browse files
authored
Merge pull request #81 from alimaredia/ilab-model-serve
Design for serving models with different backends
2 parents 025d23b + fc7c694 commit 0228952

File tree

2 files changed

+146
-0
lines changed

2 files changed

+146
-0
lines changed

.spellcheck-en-custom.txt

+4
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ GiB
4949
Gmail
5050
gpu
5151
Guang
52+
hardcoded
5253
hipBLAS
5354
ilab
5455
impactful
@@ -124,6 +125,8 @@ Shivchander
124125
Signoff
125126
Sigstore
126127
Srivastava
128+
subcommand
129+
subcommands
127130
subdirectory
128131
Sudalairaj
129132
Taj
@@ -145,6 +148,7 @@ USM
145148
UX
146149
venv
147150
Vishnoi
151+
vllm
148152
watsonx
149153
Wikisource
150154
wikisql

docs/cli/ilab-model-serve-backend.md

+142
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Design for `ilab model serve` command with backend support
2+
3+
## Background
4+
5+
With the [request from the community](https://github.com/instructlab/instructlab/issues/1106) for `ilab` to serve different backends such as [vllm](https://docs.vllm.ai/en/stable/) and the [cli redesign](ilab-model-backend.md), this design doc's purpose is to flesh out the behavior of the `ilab model serve` command.
6+
7+
Specifically, this doc addresses the design of subcommands of `ilab model serve` that apply to
8+
different serving backends.
9+
10+
## Design
11+
12+
### Backend
13+
14+
Since the subject of the `ilab model serve` command is a model, regardless of the format of the model, every command takes in the `--model` flag or uses its default value in the config.
15+
16+
`ilab model serve` has a new flag `--backend` that will be used to serve models with. As of this design, the two backends `ilab` would serve with are `llama-cpp` and `vllm`.
17+
18+
This would lead to the commands:
19+
20+
- `ilab model serve --backend llama-cpp`
21+
- `ilab model serve --backend vllm`
22+
23+
There are specific flags for `ilab model serve` that would apply to all backends. These can be viewed by running `ilab model serve --help`.
24+
25+
The following is an overview for the flags of `ilab model serve`:
26+
27+
```console
28+
ilab model serve
29+
|
30+
|_______ (backend agnostic flags)
31+
|
32+
|_______ --backend ['llama-cpp', 'vllm']
33+
|_______ --backend-args
34+
```
35+
36+
The `backend` flag will also be available as an option in the config file (`config.yaml`). This will allow users to
37+
set a default backend for `ilab model serve` in the config. Also, commands like `ilab model chat`
38+
and `ilab data generate` that serve models in the background will use the default backend specified
39+
in the config. Here is an example of what the config file would look like:
40+
41+
```yaml
42+
serve:
43+
gpu_layers: -1
44+
host_port: 127.0.0.1:8000
45+
max_ctx_size: 4096
46+
model_path: models/merlinite-7b-lab-Q4_K_M.gguf
47+
backend: llama-cpp
48+
```
49+
50+
### Backend flags
51+
52+
The `--backend-args` flag is a string that will be passed to the backend as arguments. This flag is used to pass
53+
backend-specific arguments to the backend. Multiple values will be supported, however the exact formatting will be
54+
defined in the implementation proposal. The backend will be responsible for parsing individual arguments.
55+
56+
It will also be available as an option in the config file (`config.yaml`). This will allow users to set default backend arguments for `ilab model serve` in the config. Here is an example of what the config file would look like:
57+
58+
```yaml
59+
serve:
60+
backend: llama-cpp
61+
backend_args:
62+
num_gpu_layers: 4
63+
max_ctx_size: 1024
64+
```
65+
66+
For clarity and ease of implementation, when using the `--backend-args` flag, the user must pass the
67+
`--backend` flag as well. This is to ensure that the backend-specific arguments are passed to the
68+
correct backend. Any backend-specific arguments that are not passed to the correct backend will be
69+
reported as an error.
70+
71+
## Command Examples
72+
73+
### Bare-bones but model specific command
74+
75+
```shell
76+
ilab model serve --model <PATH>
77+
```
78+
79+
- Serves the model at `<PATH>`.
80+
- If the `<PATH>` is the path for a model that can be run by `llama-cpp` then `llama-cpp` is
81+
automatically used as the model serving backend. The current auto-detection logic will rely on a
82+
valid GGUF file format. If the model is a valid GGUF file, then `llama-cpp` will be used as the model serving backend.
83+
- If the `<PATH>` is the path for a model that can be run by `vllm` then `vllm` is automatically used as the model serving backend.
84+
- If the model at `<PATH>` can be run by either backend, then a default backend defined in the
85+
config will be used as the model serving backend. In the case where there is ambiguity and a setting is not defined, a hardcoded preference will be used (all currently supported providers do not have this issue). A future profile specification will likely replace the hardcoded fallback.
86+
87+
### Bare-bones command
88+
89+
```shell
90+
ilab model serve
91+
```
92+
93+
- This command has the same behavior as the one above but the `--model` is whatever the default model path is in the config. This is the existing behavior of `ilab serve` today.
94+
95+
### Llama-cpp backend specific commands
96+
97+
```shell
98+
ilab model serve --model <PATH> --backend llama-cpp --backend-args '--num-gpu-layers 4'
99+
```
100+
101+
- This command serves a model with `llama-cpp`.
102+
- If the model provided is not able to be served by llama-cpp, this command would error out and suggest an alternate backend to use.
103+
- The existing flags to `ilab serve` (besides `--model-path` & `--log-file`) are now specific to the llama-cpp backend.
104+
105+
### vllm backend specific commands
106+
107+
```shell
108+
ilab model serve --model <PATH> --backend vllm --backend-args '--chat-template <PATH>'
109+
```
110+
111+
- This command serves a model with `vllm`.
112+
- If the path provided is not able to be served by `vllm`, this command would error out and suggest an alternate backend to use.
113+
- There are [dozens](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#command-line-arguments-for-the-server) of flags for vllm. Whichever arguments the community deems the most important to include, will be added as flags to `ilab model serve vllm`.
114+
- Any remaining arguments can be specified in the value of the flag `--vllm-args`.
115+
116+
## Testing
117+
118+
An additional end-to-end test will be added that for a new backend for `ilab model serve`. This new test should be triggered whenever code changes to the new backend serving code are made or before a release.
119+
120+
This new test will do the following:
121+
122+
1. Initialize ilab in a virtual env via `ilab config init`.
123+
2. Download a model via `ilab model download`.
124+
3. Serve the downloaded model with the new backend via `ilab model serve`.
125+
4. Generate synthetic data using the served model via `ilab data generate`.
126+
5. Chat with the served model via `ilab model chat`.
127+
6. Any future commands that interact with a served model should be added to the test.
128+
129+
Some commands, like `ilab model chat` and `ilab data generate`, serve models in the background as part of the command. If automatic serving of a new backend is implemented for a command, testing of that command that will also be included in the new end-to-end test.
130+
131+
## Handling existing backend-specific commands
132+
133+
The existing `ilab model serve` command has flags that are specific to the `llama-cpp` backend. The current list of flags are:
134+
135+
- `--num-gpu-layers`
136+
- `--max-ctx-size`
137+
- `--num-threads`
138+
139+
These flags will be moved to `--backend-args` and will be used as the default arguments for
140+
`llama-cpp` backend. This will allow for a more consistent experience across backends. The flag will
141+
be supported up to two releases after the release of the new backend. After that, the flag will be
142+
removed. During the two releases, a warning will be printed to the user when the flag is used.

0 commit comments

Comments
 (0)