|
| 1 | +# Design for `ilab model serve` command with backend support |
| 2 | + |
| 3 | +## Background |
| 4 | + |
| 5 | +With the [request from the community](https://github.com/instructlab/instructlab/issues/1106) for `ilab` to serve different backends such as [vllm](https://docs.vllm.ai/en/stable/) and the [cli redesign](ilab-model-backend.md), this design doc's purpose is to flesh out the behavior of the `ilab model serve` command. |
| 6 | + |
| 7 | +Specifically, this doc addresses the design of subcommands of `ilab model serve` that apply to |
| 8 | +different serving backends. |
| 9 | + |
| 10 | +## Design |
| 11 | + |
| 12 | +### Backend |
| 13 | + |
| 14 | +Since the subject of the `ilab model serve` command is a model, regardless of the format of the model, every command takes in the `--model` flag or uses its default value in the config. |
| 15 | + |
| 16 | +`ilab model serve` has a new flag `--backend` that will be used to serve models with. As of this design, the two backends `ilab` would serve with are `llama-cpp` and `vllm`. |
| 17 | + |
| 18 | +This would lead to the commands: |
| 19 | + |
| 20 | +- `ilab model serve --backend llama-cpp` |
| 21 | +- `ilab model serve --backend vllm` |
| 22 | + |
| 23 | +There are specific flags for `ilab model serve` that would apply to all backends. These can be viewed by running `ilab model serve --help`. |
| 24 | + |
| 25 | +The following is an overview for the flags of `ilab model serve`: |
| 26 | + |
| 27 | +```console |
| 28 | +ilab model serve |
| 29 | +| |
| 30 | +|_______ (backend agnostic flags) |
| 31 | +| |
| 32 | +|_______ --backend ['llama-cpp', 'vllm'] |
| 33 | +|_______ --backend-args |
| 34 | +``` |
| 35 | + |
| 36 | +The `backend` flag will also be available as an option in the config file (`config.yaml`). This will allow users to |
| 37 | +set a default backend for `ilab model serve` in the config. Also, commands like `ilab model chat` |
| 38 | +and `ilab data generate` that serve models in the background will use the default backend specified |
| 39 | +in the config. Here is an example of what the config file would look like: |
| 40 | + |
| 41 | +```yaml |
| 42 | +serve: |
| 43 | + gpu_layers: -1 |
| 44 | + host_port: 127.0.0.1:8000 |
| 45 | + max_ctx_size: 4096 |
| 46 | + model_path: models/merlinite-7b-lab-Q4_K_M.gguf |
| 47 | + backend: llama-cpp |
| 48 | +``` |
| 49 | +
|
| 50 | +### Backend flags |
| 51 | +
|
| 52 | +The `--backend-args` flag is a string that will be passed to the backend as arguments. This flag is used to pass |
| 53 | +backend-specific arguments to the backend. Multiple values will be supported, however the exact formatting will be |
| 54 | +defined in the implementation proposal. The backend will be responsible for parsing individual arguments. |
| 55 | + |
| 56 | +It will also be available as an option in the config file (`config.yaml`). This will allow users to set default backend arguments for `ilab model serve` in the config. Here is an example of what the config file would look like: |
| 57 | + |
| 58 | +```yaml |
| 59 | +serve: |
| 60 | + backend: llama-cpp |
| 61 | + backend_args: |
| 62 | + num_gpu_layers: 4 |
| 63 | + max_ctx_size: 1024 |
| 64 | +``` |
| 65 | + |
| 66 | +For clarity and ease of implementation, when using the `--backend-args` flag, the user must pass the |
| 67 | +`--backend` flag as well. This is to ensure that the backend-specific arguments are passed to the |
| 68 | +correct backend. Any backend-specific arguments that are not passed to the correct backend will be |
| 69 | +reported as an error. |
| 70 | + |
| 71 | +## Command Examples |
| 72 | + |
| 73 | +### Bare-bones but model specific command |
| 74 | + |
| 75 | +```shell |
| 76 | +ilab model serve --model <PATH> |
| 77 | +``` |
| 78 | + |
| 79 | +- Serves the model at `<PATH>`. |
| 80 | +- If the `<PATH>` is the path for a model that can be run by `llama-cpp` then `llama-cpp` is |
| 81 | + automatically used as the model serving backend. The current auto-detection logic will rely on a |
| 82 | + valid GGUF file format. If the model is a valid GGUF file, then `llama-cpp` will be used as the model serving backend. |
| 83 | +- If the `<PATH>` is the path for a model that can be run by `vllm` then `vllm` is automatically used as the model serving backend. |
| 84 | +- If the model at `<PATH>` can be run by either backend, then a default backend defined in the |
| 85 | + config will be used as the model serving backend. In the case where there is ambiguity and a setting is not defined, a hardcoded preference will be used (all currently supported providers do not have this issue). A future profile specification will likely replace the hardcoded fallback. |
| 86 | + |
| 87 | +### Bare-bones command |
| 88 | + |
| 89 | +```shell |
| 90 | +ilab model serve |
| 91 | +``` |
| 92 | + |
| 93 | +- This command has the same behavior as the one above but the `--model` is whatever the default model path is in the config. This is the existing behavior of `ilab serve` today. |
| 94 | + |
| 95 | +### Llama-cpp backend specific commands |
| 96 | + |
| 97 | +```shell |
| 98 | +ilab model serve --model <PATH> --backend llama-cpp --backend-args '--num-gpu-layers 4' |
| 99 | +``` |
| 100 | + |
| 101 | +- This command serves a model with `llama-cpp`. |
| 102 | +- If the model provided is not able to be served by llama-cpp, this command would error out and suggest an alternate backend to use. |
| 103 | +- The existing flags to `ilab serve` (besides `--model-path` & `--log-file`) are now specific to the llama-cpp backend. |
| 104 | + |
| 105 | +### vllm backend specific commands |
| 106 | + |
| 107 | +```shell |
| 108 | +ilab model serve --model <PATH> --backend vllm --backend-args '--chat-template <PATH>' |
| 109 | +``` |
| 110 | + |
| 111 | +- This command serves a model with `vllm`. |
| 112 | +- If the path provided is not able to be served by `vllm`, this command would error out and suggest an alternate backend to use. |
| 113 | +- There are [dozens](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#command-line-arguments-for-the-server) of flags for vllm. Whichever arguments the community deems the most important to include, will be added as flags to `ilab model serve vllm`. |
| 114 | +- Any remaining arguments can be specified in the value of the flag `--vllm-args`. |
| 115 | + |
| 116 | +## Testing |
| 117 | + |
| 118 | +An additional end-to-end test will be added that for a new backend for `ilab model serve`. This new test should be triggered whenever code changes to the new backend serving code are made or before a release. |
| 119 | + |
| 120 | +This new test will do the following: |
| 121 | + |
| 122 | +1. Initialize ilab in a virtual env via `ilab config init`. |
| 123 | +2. Download a model via `ilab model download`. |
| 124 | +3. Serve the downloaded model with the new backend via `ilab model serve`. |
| 125 | +4. Generate synthetic data using the served model via `ilab data generate`. |
| 126 | +5. Chat with the served model via `ilab model chat`. |
| 127 | +6. Any future commands that interact with a served model should be added to the test. |
| 128 | + |
| 129 | +Some commands, like `ilab model chat` and `ilab data generate`, serve models in the background as part of the command. If automatic serving of a new backend is implemented for a command, testing of that command that will also be included in the new end-to-end test. |
| 130 | + |
| 131 | +## Handling existing backend-specific commands |
| 132 | + |
| 133 | +The existing `ilab model serve` command has flags that are specific to the `llama-cpp` backend. The current list of flags are: |
| 134 | + |
| 135 | +- `--num-gpu-layers` |
| 136 | +- `--max-ctx-size` |
| 137 | +- `--num-threads` |
| 138 | + |
| 139 | +These flags will be moved to `--backend-args` and will be used as the default arguments for |
| 140 | +`llama-cpp` backend. This will allow for a more consistent experience across backends. The flag will |
| 141 | +be supported up to two releases after the release of the new backend. After that, the flag will be |
| 142 | +removed. During the two releases, a warning will be printed to the user when the flag is used. |
0 commit comments