Punica: Serving multiple LoRA finetuned LLM as one

Demo

punica-tui-demo-vp9.webm

python examples/tui-multi-lora.py

Overview

Low rank adapation (LoRA) is a parameter efficient way to add new knowledge to a pretrained LLM. Although the pretrained LLM takes 100s of GB storage, a LoRA finetuned model only adds 1% storage and memory overhead. Punica enables running multiple LoRA finetuned models at the cost of running one.

How?

Assuming W of shape [H1, H2] is the weight of the pretrained model, LoRA adds two small matrices A of shape [H1, r] and B of [r, H2]. Running a input x on the finetuned model would be y := x @ (W + A@B), which is the same as y := x@W + x@A@B.

When there are n LoRA models, there will be A1, B1, A2, B2, ..., An, Bn. Given a input batch X := (x1,x2,...,xn) that maps to each LoRA model, the output is Y := X@W + (x1@A1@B1, x2@A2@B2, ..., xn@An@Bn). The left-hand-side computes the input batch on the pretrained model. It is quite efficient. The latency is almost the same as when there's only one input, thanks to the strong batching effect.

We figured out an efficient way to compute the right-hand-side (the LoRA addon). We encapsulate this operation in a CUDA kernel, called Segmented Gather Matrix-Vector multiplication (SGMV), as illustrated below.

In the following microbenchmark figure, we can observe the strong batching effect of the pretrained model. Naive implementation of LoRA is slow, as depicted in the orange line. LoRA implemented via SGMV is efficient and preserves the strong batching effect.

The following figure shows the text generation throughput comparison between Punica and other systems, including HuggingFace Transformers, DeepSpeed, FasterTransformer, vLLM. The benchmark considers different settings of LoRA model popularity. Distinct means that each request is for a different LoRA model. Identical means that all requests are for the same LoRA model. Uniform and Skewed are in between. Punica achieves 12x throughput compared to state-of-the-art systems.

Read our paper to understand more: Punica: Multi-Tenant LoRA Serving.

Install

git clone https://github.com/punica-ai/punica.git
cd punica
git submodule sync
git submodule update --init

pip install ninja torch
pip install -v --no-build-isolation .

Examples

Serving multiple LoRA models

See the demo above.

Finetune & convert to Punica format & serve with Punica

See examples/finetune/

Benchmark text generation

python -m benchmarks.bench_textgen_lora --system punica --batch-size 32

Citation

@misc{punica,
    title={Punica: Multi-Tenant LoRA Serving},
    author={Lequn Chen and Zihao Ye and Yongji Wu and Danyang Zhuo and Luis Ceze and Arvind Krishnamurthy},
    year={2023},
    eprint={2310.18547},
    archivePrefix={arXiv},
    primaryClass={cs.DC}
}

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github/workflows		.github/workflows
assets		assets
benchmarks		benchmarks
ci		ci
csrc		csrc
examples		examples
licenses		licenses
src/punica		src/punica
tests		tests
third_party		third_party
.clang-format		.clang-format
.clangd		.clangd
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Punica: Serving multiple LoRA finetuned LLM as one

Demo

Overview

Install

Examples

Serving multiple LoRA models

Finetune & convert to Punica format & serve with Punica

Benchmark text generation

Citation

About

Releases

Packages

Languages

License

yzh119/punica

Folders and files

Latest commit

History

Repository files navigation

Punica: Serving multiple LoRA finetuned LLM as one

Demo

Overview

Install

Examples

Serving multiple LoRA models

Finetune & convert to Punica format & serve with Punica

Benchmark text generation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages