feat: Fail requests if server does not return expected number of toke…

…ns. Add tests. Update README
huggingface · Oct 8, 2024 · 1e9a57c · 1e9a57c
1 parent 0437f2b
commit 1e9a57c
Show file tree

Hide file tree

Showing 6 changed files with 382 additions and 48 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -10,7 +10,7 @@ reqwest-eventsource = "0.6.0"
 log = "0.4.22"
 serde_json = "1.0.127"
 serde = { version = "1.0.209", features = ["derive"] }
-tokio = { version = "1.40.0", features = ["rt", "rt-multi-thread", "macros","signal"] }
+tokio = { version = "1.40.0", features = ["rt", "rt-multi-thread", "macros", "signal"] }
 anyhow = "1.0.86"
 tokenizers = { version = "0.20.0", features = ["http"] }
 rand_distr = "0.4.3"
@@ -32,3 +32,4 @@ indicatif = "0.17.8"
 rayon = "1.10.0"
 serde_with = "3.9.0"
 sysinfo = "0.31.4"
+mockito = "1.5.0"
diff --git a/README.md b/README.md
@@ -1,56 +1,89 @@
-# Text Generation Inference benchmarking tool
+# TGI Benchmark: A High-Performance Tool for Text Generation Model Benchmarking
 
-A lightweight benchmarking tool for LLM inference servers.
-Benchmarks using constant arrival rate or constant virtual user count.
+Benchmarking inference servers for text generation models presents unique challenges. 
+The performance of these models can vary greatly depending on factors like input prompts, 
+decoding strategies, hardware specifications, and server configurations.
+
+**TGI Benchmark** is designed to streamline this process by providing a comprehensive benchmarking tool 
+that evaluates the real-world performance of text generation models and servers. 
+With **TGI Benchmark**, you can easily test your model's throughput and efficiency under various workloads, 
+identify performance bottlenecks, and optimize your deployment for production environments.
+
+It can be used to benchmark any text generation server that exposes an OpenAI-compliant API.
+
+## Features
+* Broad Compatibility: Benchmarks any text generation server with an OpenAPI-compliant API.
+* Automatic Sweep Mode: Detects maximum throughput and sweeps in-between.
+* Open-Loop Benchmarking: Uses constant arrival rates to simulate real-world workloads.
+* High-Performance: Built with Rust 🦀 for high-performance benchmarking.
+* JSON Output: Delivers performance results in a structured, easy-to-analyze format.
 
 ![ui.png](assets/ui.png)
 
 ## Table of contents
 
 <!-- TOC -->
-* [Text Generation Inference benchmarking tool](#text-generation-inference-benchmarking-tool)
+* [TGI Benchmark: A High-Performance Tool for Text Generation Model Benchmarking](#tgi-benchmark-a-high-performance-tool-for-text-generation-model-benchmarking)
+  * [Features](#features)
   * [Table of contents](#table-of-contents)
-  * [TODO](#todo)
   * [Get started](#get-started)
     * [Run a benchmark](#run-a-benchmark)
+      * [1. Start an inference server](#1-start-an-inference-server)
+      * [2. Run a benchmark using Docker image](#2-run-a-benchmark-using-docker-image)
     * [Configure your benchmark](#configure-your-benchmark)
       * [Benchmark mode](#benchmark-mode)
       * [Dataset configuration](#dataset-configuration)
       * [Prompt configuration](#prompt-configuration)
     * [Decode options](#decode-options)
   * [Development](#development)
   * [Frequently Asked Questions](#frequently-asked-questions)
+  * [TODO](#todo)
 <!-- TOC -->
 
-## TODO
-
-- [X] Customizable token count and variance
-- [ ] Check results
-- [X] Allow for system prompts for prefix caching
-- [ ] Allow for multi-turn prompts
-- [ ] Push results to Optimum benchmark backend
-- [X] Script to generate plots from results
-- [X] Add support for multiple tokens in stream chunks (when speculation is active)
 
 ## Get started
 
 ### Run a benchmark
 
-Run a benchmark using Docker image:
+#### 1. Start an inference server
+**TGI**
+```bash
+MODEL=meta-llama/Llama-3.1-8B-Instruct
+HF_TOKEN=<your HF READ token>
+
+docker run --gpus all --shm-size 1g -p 8080:80 -e "HF_TOKEN=$HF_TOKEN" \
+    ghcr.io/huggingface/text-generation-inference:2.3.1 --model-id $MODEL
+```
+
+**vLLM**
+```bash
+MODEL=meta-llama/Llama-3.1-8B-Instruct
+HF_TOKEN=<your HF READ token>
+docker run --runtime nvidia --gpus all \
+    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+    -p 8080:8000 \
+    --ipc=host \
+    vllm/vllm-openai:latest \
+    --model $MODEL
+```
+
+#### 2. Run a benchmark using Docker image
 
 ```shell
-# start a TGI/vLLM server somewhere, then run benchmark...
-# ... we mount results to the current directory
+MODEL=meta-llama/Llama-3.1-8B-Instruct
+HF_TOKEN=<your HF READ token>
+# we mount results to the current directory
 $ docker run \
     --rm \
     -it \
     --net host \
     -v $(pwd):/opt/text-generation-inference-benchmark/results \
+    -e "HF_TOKEN=$HF_TOKEN" \
     ghcr.io/huggingface/text-generation-inference-benchmark:latest \
     text-generation-inference-benchmark \
-    --tokenizer-name "Qwen/Qwen2-7B" \
+    --tokenizer-name "$MODEL" \
     --max-vus 800 \
-    --url http:/localhost:8080 \
+    --url http://localhost:8080 \
     --warmup 20s \
     --num-rates 10 \
     --prompt-options "num_tokens=50,max_tokens=60,min_tokens=40,variance=10" \
@@ -152,8 +185,24 @@ $ make build
   iterations. **Constant arrival rate** is an open-loop model more representative of real-life workloads.
 
 
+* **Why do I get high error rate when running `thoughput` benchmark?**
+  Throughput bench tries to saturate the server with a high request rate. The error rate is high because the server is
+  not able to handle the request rate or rate limiting the requests.
+  In the case of TGI, this is controlled by the `--max-concurrent-requests` option.
+
+
 * **What is the influence of CUDA graphs?**
   CUDA graphs are used to optimize the GPU usage by minimizing the overhead of launching kernels. This can lead to
   better performance in some cases, but can also lead to worse performance in others.
   If your CUDA graphs are not evenly distributed, you may see a performance drop at some request rates as batch size may
   fall in a bigger CUDA graph batch size leading to a lost of compute due to excessive padding.
+
+
+## TODO
+
+- [X] Customizable token count and variance
+- [X] Check results
+- [X] Allow for system prompts for prefix caching
+- [ ] Allow for multi-turn prompts
+- [X] Script to generate plots from results
+- [X] Add support for multiple tokens in stream chunks (when speculation is active)
diff --git a/src/benchmark.rs b/src/benchmark.rs
@@ -45,8 +45,8 @@ pub enum Event {
 }
 
 pub struct Benchmark {
-    start_time: Option<std::time::Instant>,
-    end_time: Option<std::time::Instant>,
+    start_time: Option<tokio::time::Instant>,
+    end_time: Option<tokio::time::Instant>,
     backend: Box<dyn TextGenerationBackend + Send + Sync>,
     requests: Arc<Mutex<dyn TextRequestGenerator + Send>>,
     report: BenchmarkReport,
@@ -131,7 +131,7 @@ impl Benchmark {
     }
 
     pub async fn run(&mut self) -> anyhow::Result<BenchmarkReport> {
-        self.start_time = Some(std::time::Instant::now());
+        self.start_time = Some(tokio::time::Instant::now());
         self.report.start();
         info!("Prewarming backend");
         self.warmup().await?;
@@ -147,7 +147,7 @@ impl Benchmark {
                 self.run_rates().await?;
             }
         }
-        self.end_time = Some(std::time::Instant::now());
+        self.end_time = Some(tokio::time::Instant::now());
         self.event_bus.send(Event::Message(MessageEvent {
             message: format!("Benchmark complete in {:?}", self.duration().expect("duration exists")),
             timestamp: chrono::Utc::now(),