Releases: mostlygeek/llama-swap
v220
Another release? Who needs to touch grass?
This release includes a small patch by first time contributor @Luiszzzor. As well a new load and concurrency testing tool has been added to the Playground. It's easier to show than tell so check out this demo video with a swap matrix example:
llama-swap-concurrency.mp4
Side note: I forgot to mention in the video that to support more than 5 or 6 (depending on your browser) concurrent requests you will need a valid TLS certificate. llama-swap supports http/2 however that requires https. In the demo I run it on my tailnet with tailscale serve 8080 which generates the Let's Encrypt cert for me.
Changelog
v219 (fixes v218)
Notes
Including details for v218 (broken) and PR #790.
llama-swap has a new routing backend. What started as a small experiment to improve the concurrency handling exploded into a full refactor of the backend. For users this the biggest change is swapping is more efficient. Requests are collated so requests for models that are already loaded will take precedence over those that awaiting loading.
It looks like:
new router: A B A B A B -> A A A B B B
old router: A B A B A B -> A B A B A B
However, just doing that wouldn't require a 12,009 line PR. There were a lot of architectural changes that makes developer quality of life a bit easier. Redundant code was removed, repo organization is centralized around the internal/ packages, new funny loading remarks were added, etc.
Also a new concurrency tester sneaked in under cmd/concurrency-tester.
Changelog
v218 (broken DO NOT USE)
v217
Changelog
v216
v215
Adds ROCm support to the new experimental performance monitor.
Thank you to @knguyen298 for this patch.
Changelog
v214
This release fixes a couple of small bugs in the UI and the new performance monitor
Contributors
- @krzychdre (#760) for finding and fixing the negative counting in the UI
- @cdwaage (#759) for fixing the bug in the nvidia-smi fallback for the performance monitor
Changelog
v213
v212
This release packs a lot into it. It introduces a new experimental performance monitor for linux machines first. In the UI there is a new tab that will show up to the last hour of statistics:
Additionally a /metrics for the common prometheus and grafana combo. A grafana dashboard example is provided to get you started. It looks like this:
Other small changes
- versionless API endpoints were added that do not require the v1/ prefix. These help with upstream peers like z.ai that do not follow the v1 versioning convention
- the
-watch-configsystem has been refactored. It supports a mounting the config file into a docker container now. This removes the requirement to mount a directory with the config in it.
Contributions from the community
Much thanks to @bankjaneo (#741), @rhtenhove (#746), @sousekd (#753).
Changelog
- aac7b87 ci: set go-version-file in release workflow
- 4e606fe ci: fix workflow bugs in release and go-ci
- a4b91e0 Changes and fixes before the release (docs/small tweaks) (#750)
- 3e3646f perf: ignore LACT devices reporting zero VRAM (#753)
- a01afe2 ci: use manifest-aware cleanup action for multi-arch :cpu (#751)
- 174e856 Multi arch cpu (#746)
- 085b54b proxy: fix data race in /running endpoint and typo in error message (#748)
- 2be3416 ui: add auto theme switch mode based on system theme (#741)
- 7e3e94a proxy,ui: add performance monitoring with Prometheus metrics (#743)
- e261745 proxy: add versionless API endpoint (#733)
- 11b7913 llama-swap.go: remove debounce, replace fmt.Printlns (#731)