Skip to content

Commit

Permalink
improve writing
Browse files Browse the repository at this point in the history
  • Loading branch information
yzh119 committed Feb 5, 2024
1 parent 805118e commit ec2d3dd
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions _posts/2024-01-03-introduce-flashinfer.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ redirect_from: "/2024/01/03/introduce-flashinfer"

LLM (Large Language Models) Serving quickly became an important workload. The efficacy of operators within Transformers – namely GEMM, Self-Attention, GEMV, and elementwise computations are critical to the overall performance of LLM serving. While optimization efforts have extensively targeted GEMM and GEMV, there is a lack of performance studies focused on Self-Attention in the context of LLM serving. In this blog post, we break Self-Attention down into three stages: prefill, decode, and append; analyze the performance bottleneck of Self-Attention on both single-request and batching scenarios in these three stages; and propose a solution to tackle these challenges. These ideas have been integrated into [FlashInfer](https://github.com/flashinfer-ai/flashinfer/), an open-source library for accelerating LLM serving released under Apache 2.0 license.

FlashInfer has been developed by researchers from the University of Washington, Carnegie Mellon University, and OctoAI since summer 2023. FlashInfer provides PyTorch APIs for quick prototyping, and a dependency-free, header-only C++ APIs for integration with existing LLM serving systems. Compared to existing libraries, FlashInfer has several unique advantages:
FlashInfer has been developed by researchers from the University of Washington, Carnegie Mellon University, and OctoAI since summer 2023. FlashInfer provides PyTorch APIs for quick prototyping, and a dependency-free, header-only C++ APIs for integration with LLM serving systems. Compared to existing libraries, FlashInfer has several unique advantages:

1. **Comprehensive Attention Kernels**: FlashInfer implements attention kernels that cover all the common use cases of LLM serving with state-of-the-art performance, including single-request and batching versions of Prefill, Decode, and Append kernels, on various formats of KV-Cache (Padded Tensor, Ragged Tensor, and Page Table).
2. **Optimized Shared-Prefix Batch Decoding**: FlashInfer enhances shared-prefix batch decoding performance through cascading, resulting in an impressive up to 31x speedup compared to the baseline vLLM PageAttention implementation (for long prompt of 32768 tokens and large batch size of 256), check [another blog post](./2024-01-08-cascade-inference.md) for more details.
Expand Down Expand Up @@ -58,7 +58,7 @@ FlashInfer implements single-request and batch version of FlashAttention for all
Many recent work proposes KV-Cache compression techniques to reduce memory traffic. In light of this,
FlashInfer optimize kernels for *Grouped-Query Attention*, *Fused-RoPE Attention* and *Quantized Attention* for efficient serving with compressed KV-Cache:
- **Grouped Query Attention**: [Grouped Query Attention](https://arxiv.org/abs/2305.13245) uses a smaller number of heads for keys and values thus saving memory traffic. The operational intensity of Grouped Query Attention grows from $O(1)$ to $O\left(\frac{H_{qo}}{H_{kv}}\right)$ where $H_{qo}$ is the number of heads for queries and $H_{kv}$ is the number of heads for keys and values. GPUs such as A100/H100 has low non-tensor cores performance, and thus traditional implementation of Grouped Query Attention is compute-bound. FlashInfer proposes to use prefill kernels (which utilizes Tensor Cores) for decode attention in GQA, and achieves up to 2-3x speedup compared to vLLM implementation.
- **Fused-RoPE Attention**: [RoPE (Rotary Positional Embeddings)](https://arxiv.org/abs/2104.09864) has become a standard component of Transformers, most existing serving systems stores post-RoPE keys (the keys after applying rotary embeddings) in KV-Cache. However, some recent work such as [StreamingLLM](https://arxiv.org/abs/2309.17453) will prune tokens in KV-Cache, and the position of tokens will change after pruning, thus the post-RoPE keys in KV-Cache will be meaningless. In this case, FlashInfer proposes to save pre-RoPE keys in KV-Cache, and fuses RoPE into attention kernel. Experiments on various platform and settings show that FlashInfer's Fused-RoPE Attention kernel can apply RoPE on the fly with negligible overhead.
- **Fused-RoPE Attention**: [RoPE (Rotary Positional Embeddings)](https://arxiv.org/abs/2104.09864) has become a standard component of Transformers, most existing serving systems stores post-RoPE keys (the keys after applying rotary embeddings) in KV-Cache. However, some recent work such as [StreamingLLM](https://arxiv.org/abs/2309.17453) will prune tokens in KV-Cache, and the position of tokens will change after pruning, thus the post-RoPE keys in KV-Cache become meaningless. In this case, FlashInfer proposes to save pre-RoPE keys in KV-Cache, and fuses RoPE into attention kernel. Experiments on various platform and settings show that FlashInfer's Fused-RoPE Attention kernel can apply RoPE on the fly with negligible overhead.
- **Quantized Attention**: Another way to compress KV-Cache is through pruning, [FlexGen](https://arxiv.org/abs/2303.06865) and [Atom](https://arxiv.org/abs/2310.19102) show that it's possible to prune KV-Cache to 4-bit with negligible accuracy loss. FlashInfer implements low-precision attention kernels so that we can achieve nearly linear speedup to the compression ratio (~4x for 4bit, ~2x for 8bit).

Some recent work such as [LightLLM](https://github.com/ModelTC/lightllm) and [sglang](https://github.com/sgl-project/sglang) uses a special form of PageAttention where page size equals one, for easy management of KV-Cache in complicated serving scenarios such as structured generation. FlashInfer optimizes PageAttention kernels by pre-fetching page indices in GPU shared memory, so that kernel performance is not affected by the page size.
Expand Down Expand Up @@ -128,7 +128,7 @@ Figure 5: Single request decode kernel performance, use Llama2-7B setting: num_k
</p>

FlashInfer achieves best performance on all 4 GPUs, and the GPU bandwidth utilization is close to 100% for long sequences.
An interesting fact is that split-KV do not improve performance for GPUs such as RTX Ada 6000 and RTX 4090 because they have relatively smaller memory bandwidth and stronger CUDA Cores performance (decode attention has low operational intensity and we use CUDA Cores in non-GQA setting). Unlike compute units which is SM local, the global memory traffic on GPUs is shared, thus using 32 (number of heads in Llama2-7B setting) of 108 SMs can still fully utilize the memory bandwidth if the operator is not compute-bound. A100 GPUs has low CUDA Cores performance (20 TFLops/s), using 32 of 108 SMs (5.9 TFLops/s) will make the kernel compute-bound (besides multiply and add, there are also time-consuming computations such as `exp` in attention computation), and split-KV will be helpful in this case.
An interesting fact is that split-KV do not improve performance for GPUs such as RTX Ada 6000 and RTX 4090 because they have relatively smaller memory bandwidth and stronger CUDA Cores performance (decode attention has low operational intensity and we use CUDA Cores in non-GQA setting). Unlike compute units which is SM local, the global memory traffic on GPUs is shared, thus using 32 (number of heads in Llama2-7B setting) of 108 SMs can still fully utilize the memory bandwidth if the operator is not compute-bound. A100 GPUs has low CUDA Cores performance (20 TFLops/s), using 32 of 108 SMs (5.9 TFLops/s) will make the kernel compute-bound (besides multiply and add, there are also time-consuming computations such as `exp` in attention computation), and split-KV is helpful in this case.

For batch decoding attention, FlashInfer implements an optimized version of PageAttention, below is performance comparison of FlashInfer PageAttention kernel and vLLM PageAttention kernel:

Expand Down Expand Up @@ -177,7 +177,7 @@ For batch GQA decoding attention, FlashInfer w/ Tensor Cores is 3x faster than v
### Fused-RoPE Attention

KV-Cache compression techniques such as [H2O](https://arxiv.org/abs/2306.14048) and [Streaming-LLM](https://github.com/mit-han-lab/streaming-llm) prunes KV-Cache by removing tokens, and the original
relative positions of tokens in KV-Cache will be polluted, storing post-RoPE keys in KV-Cache will be meaningless. FlashInfer implements high-performance Fused-RoPE attention kernels which applies RoPE on the fly, below is the performance comparison of FlashInfer decoding attention with and without RoPE:
relative positions of tokens in KV-Cache will be polluted, storing post-RoPE keys in KV-Cache become meaningless. FlashInfer implements high-performance Fused-RoPE attention kernels which applies RoPE on the fly, below is the performance comparison of FlashInfer decoding attention with and without RoPE:

<p align="center">
<img src="/assets/imgs/fused-rope-attention.png" alt="fused rope attention" width="800"/>
Expand Down

0 comments on commit ec2d3dd

Please sign in to comment.