Skip to content

Commit

Permalink
feat: Fail requests if server does not return expected number of toke…
Browse files Browse the repository at this point in the history
…ns. Add tests. Update README
  • Loading branch information
Hugoch committed Oct 8, 2024
1 parent 0437f2b commit 1e9a57c
Show file tree
Hide file tree
Showing 6 changed files with 382 additions and 48 deletions.
3 changes: 2 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ reqwest-eventsource = "0.6.0"
log = "0.4.22"
serde_json = "1.0.127"
serde = { version = "1.0.209", features = ["derive"] }
tokio = { version = "1.40.0", features = ["rt", "rt-multi-thread", "macros","signal"] }
tokio = { version = "1.40.0", features = ["rt", "rt-multi-thread", "macros", "signal"] }
anyhow = "1.0.86"
tokenizers = { version = "0.20.0", features = ["http"] }
rand_distr = "0.4.3"
Expand All @@ -32,3 +32,4 @@ indicatif = "0.17.8"
rayon = "1.10.0"
serde_with = "3.9.0"
sysinfo = "0.31.4"
mockito = "1.5.0"
87 changes: 68 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,56 +1,89 @@
# Text Generation Inference benchmarking tool
# TGI Benchmark: A High-Performance Tool for Text Generation Model Benchmarking

A lightweight benchmarking tool for LLM inference servers.
Benchmarks using constant arrival rate or constant virtual user count.
Benchmarking inference servers for text generation models presents unique challenges.
The performance of these models can vary greatly depending on factors like input prompts,
decoding strategies, hardware specifications, and server configurations.

**TGI Benchmark** is designed to streamline this process by providing a comprehensive benchmarking tool
that evaluates the real-world performance of text generation models and servers.
With **TGI Benchmark**, you can easily test your model's throughput and efficiency under various workloads,
identify performance bottlenecks, and optimize your deployment for production environments.

It can be used to benchmark any text generation server that exposes an OpenAI-compliant API.

## Features
* Broad Compatibility: Benchmarks any text generation server with an OpenAPI-compliant API.
* Automatic Sweep Mode: Detects maximum throughput and sweeps in-between.
* Open-Loop Benchmarking: Uses constant arrival rates to simulate real-world workloads.
* High-Performance: Built with Rust 🦀 for high-performance benchmarking.
* JSON Output: Delivers performance results in a structured, easy-to-analyze format.

![ui.png](assets/ui.png)

## Table of contents

<!-- TOC -->
* [Text Generation Inference benchmarking tool](#text-generation-inference-benchmarking-tool)
* [TGI Benchmark: A High-Performance Tool for Text Generation Model Benchmarking](#tgi-benchmark-a-high-performance-tool-for-text-generation-model-benchmarking)
* [Features](#features)
* [Table of contents](#table-of-contents)
* [TODO](#todo)
* [Get started](#get-started)
* [Run a benchmark](#run-a-benchmark)
* [1. Start an inference server](#1-start-an-inference-server)
* [2. Run a benchmark using Docker image](#2-run-a-benchmark-using-docker-image)
* [Configure your benchmark](#configure-your-benchmark)
* [Benchmark mode](#benchmark-mode)
* [Dataset configuration](#dataset-configuration)
* [Prompt configuration](#prompt-configuration)
* [Decode options](#decode-options)
* [Development](#development)
* [Frequently Asked Questions](#frequently-asked-questions)
* [TODO](#todo)
<!-- TOC -->

## TODO

- [X] Customizable token count and variance
- [ ] Check results
- [X] Allow for system prompts for prefix caching
- [ ] Allow for multi-turn prompts
- [ ] Push results to Optimum benchmark backend
- [X] Script to generate plots from results
- [X] Add support for multiple tokens in stream chunks (when speculation is active)

## Get started

### Run a benchmark

Run a benchmark using Docker image:
#### 1. Start an inference server
**TGI**
```bash
MODEL=meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your HF READ token>

docker run --gpus all --shm-size 1g -p 8080:80 -e "HF_TOKEN=$HF_TOKEN" \
ghcr.io/huggingface/text-generation-inference:2.3.1 --model-id $MODEL
```

**vLLM**
```bash
MODEL=meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your HF READ token>
docker run --runtime nvidia --gpus all \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-p 8080:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model $MODEL
```

#### 2. Run a benchmark using Docker image

```shell
# start a TGI/vLLM server somewhere, then run benchmark...
# ... we mount results to the current directory
MODEL=meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your HF READ token>
# we mount results to the current directory
$ docker run \
--rm \
-it \
--net host \
-v $(pwd):/opt/text-generation-inference-benchmark/results \
-e "HF_TOKEN=$HF_TOKEN" \
ghcr.io/huggingface/text-generation-inference-benchmark:latest \
text-generation-inference-benchmark \
--tokenizer-name "Qwen/Qwen2-7B" \
--tokenizer-name "$MODEL" \
--max-vus 800 \
--url http:/localhost:8080 \
--url http://localhost:8080 \
--warmup 20s \
--num-rates 10 \
--prompt-options "num_tokens=50,max_tokens=60,min_tokens=40,variance=10" \
Expand Down Expand Up @@ -152,8 +185,24 @@ $ make build
iterations. **Constant arrival rate** is an open-loop model more representative of real-life workloads.


* **Why do I get high error rate when running `thoughput` benchmark?**
Throughput bench tries to saturate the server with a high request rate. The error rate is high because the server is
not able to handle the request rate or rate limiting the requests.
In the case of TGI, this is controlled by the `--max-concurrent-requests` option.


* **What is the influence of CUDA graphs?**
CUDA graphs are used to optimize the GPU usage by minimizing the overhead of launching kernels. This can lead to
better performance in some cases, but can also lead to worse performance in others.
If your CUDA graphs are not evenly distributed, you may see a performance drop at some request rates as batch size may
fall in a bigger CUDA graph batch size leading to a lost of compute due to excessive padding.


## TODO

- [X] Customizable token count and variance
- [X] Check results
- [X] Allow for system prompts for prefix caching
- [ ] Allow for multi-turn prompts
- [X] Script to generate plots from results
- [X] Add support for multiple tokens in stream chunks (when speculation is active)
8 changes: 4 additions & 4 deletions src/benchmark.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ pub enum Event {
}

pub struct Benchmark {
start_time: Option<std::time::Instant>,
end_time: Option<std::time::Instant>,
start_time: Option<tokio::time::Instant>,
end_time: Option<tokio::time::Instant>,
backend: Box<dyn TextGenerationBackend + Send + Sync>,
requests: Arc<Mutex<dyn TextRequestGenerator + Send>>,
report: BenchmarkReport,
Expand Down Expand Up @@ -131,7 +131,7 @@ impl Benchmark {
}

pub async fn run(&mut self) -> anyhow::Result<BenchmarkReport> {
self.start_time = Some(std::time::Instant::now());
self.start_time = Some(tokio::time::Instant::now());
self.report.start();
info!("Prewarming backend");
self.warmup().await?;
Expand All @@ -147,7 +147,7 @@ impl Benchmark {
self.run_rates().await?;
}
}
self.end_time = Some(std::time::Instant::now());
self.end_time = Some(tokio::time::Instant::now());
self.event_bus.send(Event::Message(MessageEvent {
message: format!("Benchmark complete in {:?}", self.duration().expect("duration exists")),
timestamp: chrono::Utc::now(),
Expand Down
Loading

0 comments on commit 1e9a57c

Please sign in to comment.