Skip to content

Releases: mostlygeek/llama-swap

v220

31 May 00:07
03d58e5

Choose a tag to compare

Another release? Who needs to touch grass?

This release includes a small patch by first time contributor @Luiszzzor. As well a new load and concurrency testing tool has been added to the Playground. It's easier to show than tell so check out this demo video with a swap matrix example:

llama-swap-concurrency.mp4

Side note: I forgot to mention in the video that to support more than 5 or 6 (depending on your browser) concurrent requests you will need a valid TLS certificate. llama-swap supports http/2 however that requires https. In the demo I run it on my tailnet with tailscale serve 8080 which generates the Let's Encrypt cert for me.

Changelog

  • 03d58e5 Add load testing tool to the UI (#805)
  • c790d0e fix: update the concurrency middleware to respond with a JSON payload (#798)

v219 (fixes v218)

29 May 22:31
4ca9c47

Choose a tag to compare

Notes

Including details for v218 (broken) and PR #790.

llama-swap has a new routing backend. What started as a small experiment to improve the concurrency handling exploded into a full refactor of the backend. For users this the biggest change is swapping is more efficient. Requests are collated so requests for models that are already loaded will take precedence over those that awaiting loading.

It looks like:

new router: A B A B A B -> A A A B B B
old router: A B A B A B -> A B A B A B 

However, just doing that wouldn't require a 12,009 line PR. There were a lot of architectural changes that makes developer quality of life a bit easier. Redundant code was removed, repo organization is centralized around the internal/ packages, new funny loading remarks were added, etc.

Also a new concurrency tester sneaked in under cmd/concurrency-tester.

image

Changelog

  • 4ca9c47 Makefile,internal/server: various release tweaks
  • 146a9ea ui-svelte: update build directory (#801)

v218 (broken DO NOT USE)

29 May 16:57
02e015f

Choose a tag to compare

This one has a bug where the release did not include the UI. Use v219 instead.

Changelog

  • 02e015f Introduce new routing backend (#790)
  • 63bc266 Add new power draw column header for rocm-smi monitoring (#788)

v217

22 May 07:20
636b53e

Choose a tag to compare

Changelog

  • 636b53e Improve rocm-smi performance monitoring (#775)
  • 59cd3b6 Added Windows performance monitoring using nvidia-smi (#773)
  • 5d1e62d Disable auto review feature in coderabbit config
  • dbb869d Increase inactivity thresholds for stale issues
  • 26bb17e config.example.yaml: Improve matrix vs groups info

v216

17 May 18:48
2982dd3

Choose a tag to compare

Changelog

  • 2982dd3 ui-svelte: update link to performance discussion thread

v215

17 May 17:28
79dc87f

Choose a tag to compare

Adds ROCm support to the new experimental performance monitor.
Thank you to @knguyen298 for this patch.

Changelog

v214

15 May 23:49
b2fcc2d

Choose a tag to compare

This release fixes a couple of small bugs in the UI and the new performance monitor

Contributors

  • @krzychdre (#760) for finding and fixing the negative counting in the UI
  • @cdwaage (#759) for fixing the bug in the nvidia-smi fallback for the performance monitor

Changelog

  • b2fcc2d ui-svelte: fix cached tokens total counting -1 sentinel (#760)
  • 6a9c4ef fix: use --loop instead of -loop for nvidia-smi (driver 540+ compat) (#759)

v213

15 May 04:59
0c813e4

Choose a tag to compare

Changelog

  • 0c813e4 ui-svelte: package updates
  • fe71e8a proxy,ui-svelte: improve support for v1/messages and v1/responses (#758)

v212

14 May 05:14
aac7b87

Choose a tag to compare

This release packs a lot into it. It introduces a new experimental performance monitor for linux machines first. In the UI there is a new tab that will show up to the last hour of statistics:

image

Additionally a /metrics for the common prometheus and grafana combo. A grafana dashboard example is provided to get you started. It looks like this:

image

Other small changes

  • versionless API endpoints were added that do not require the v1/ prefix. These help with upstream peers like z.ai that do not follow the v1 versioning convention
  • the -watch-config system has been refactored. It supports a mounting the config file into a docker container now. This removes the requirement to mount a directory with the config in it.

Contributions from the community

Much thanks to @bankjaneo (#741), @rhtenhove (#746), @sousekd (#753).

Changelog

  • aac7b87 ci: set go-version-file in release workflow
  • 4e606fe ci: fix workflow bugs in release and go-ci
  • a4b91e0 Changes and fixes before the release (docs/small tweaks) (#750)
  • 3e3646f perf: ignore LACT devices reporting zero VRAM (#753)
  • a01afe2 ci: use manifest-aware cleanup action for multi-arch :cpu (#751)
  • 174e856 Multi arch cpu (#746)
  • 085b54b proxy: fix data race in /running endpoint and typo in error message (#748)
  • 2be3416 ui: add auto theme switch mode based on system theme (#741)
  • 7e3e94a proxy,ui: add performance monitoring with Prometheus metrics (#743)
  • e261745 proxy: add versionless API endpoint (#733)
  • 11b7913 llama-swap.go: remove debounce, replace fmt.Printlns (#731)

v211

02 May 19:22
c79114d

Choose a tag to compare

Changelog

  • c79114d proxy: fix logger not checking matrix for processes