diff --git a/_posts/2024-01-03-introduce-flashinfer.md b/_posts/2024-01-03-introduce-flashinfer.md index 136a55b..edb854b 100644 --- a/_posts/2024-01-03-introduce-flashinfer.md +++ b/_posts/2024-01-03-introduce-flashinfer.md @@ -161,7 +161,7 @@ FlashInfer also implemented batch append attention kernel where key/value is sto

single gqa benchmarks
-Figure 8: Single request GQA decode performance, use llama2-70b setting: tp=2, num_kv_heads=4, num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 8192. +Figure 8: Single request GQA decode performance, use llama2-70b setting: tp=2, num_kv_heads=4, num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 65536.

For single-request GQA decoding attention, FlashInfer (Tensor Cores) achieves better performance than FlashAttention 2.4.2 on both A100 & H100, and FlashInfer (CUDA Cores) can only achieve 40%+ bandwidth utilization because of limited CUDA Cores performance. @@ -196,7 +196,7 @@ FlashInfer implements high-performance fp8 decode decode kernels, which could ac

fp8 attention
-Figure 11: FP8 decode attention performance, use llama-7b setting: num_kv_heads=num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 8192. +Figure 11: FP8 decode attention performance, use Llama2-7B setting: num_kv_heads=num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 65536.

There is some gap between bandwidth utilization of fp8 and fp16 kernels, however the gap is getting closer as the query length grows. diff --git a/_posts/2024-01-08-cascade-inference.md b/_posts/2024-01-08-cascade-inference.md index 2d156f0..bb8fd0f 100644 --- a/_posts/2024-01-08-cascade-inference.md +++ b/_posts/2024-01-08-cascade-inference.md @@ -64,7 +64,7 @@ The above n-ary merge operator is consistent with the binary merge operator, and

recursive-attention
-Figure 3. Different order to merge attention states are mathematically equivalent. +Figure 2. Different order to merge attention states are mathematically equivalent.

Recursive Attention allow us to decompose attention computation into multiple stages, different stages @@ -78,12 +78,12 @@ we propose the following Divide-and-Conquer algorithm: 2. Use batch decode attention kernel to compute the attention state between queries and KV-Cache of unique suffixes. 3. Use merge operator to combine two attention states to get the final attention output. -The overall workflow is explained on the left side of Figure 2, different color of rectangles are processed in different thread blocks in GPU. Note that for multi-query attention kernels, we access KV-Cache through SMEM or registers and for decode kernels we can only access KV-Cache through L2 Cache or Global Memory. Cascade Inference allow us to maximize memory reuse for common prefix, thus making the attention computation much more memory efficient. +The overall workflow is explained on the left side of Figure 3, different color of rectangles are processed in different thread blocks in GPU. Note that for multi-query attention kernels, we access KV-Cache through SMEM or registers and for decode kernels we can only access KV-Cache through L2 Cache or Global Memory. Cascade Inference allow us to maximize memory reuse for common prefix, thus making the attention computation much more memory efficient.

Cascade Inference
-Figure 2. Workflow of Cascade Inference, throughput values adapted from blog: TPU vs GPU vs Cerebras vs Graphcore: A Fair Comparison between ML Hardware +Figure 3. Workflow of Cascade Inference, throughput values adapted from blog: TPU vs GPU vs Cerebras vs Graphcore: A Fair Comparison between ML Hardware

We call the divide-and-conquer approach for shared-prefix attention the "Cascade Inference". @@ -95,16 +95,16 @@ We evaluate Cascade Inference on H100 SXM 80GB and A100 PCIE 80GB GPUs. The inpu

speedup-h100
-Figure 3. Speedup over vLLM PageAttention on H100 SXM 80GB +Figure 4. Speedup over vLLM PageAttention on H100 SXM 80GB

speedup-a100
-Figure 4. Speedup over vLLM PageAttention on A100 PCIe 80GB +Figure 5. Speedup over vLLM PageAttention on A100 PCIe 80GB

-Figure 3 and 4 show the normalized performance on FlashInfer kernels in cascading and non-cascading setting +Figure 4 and 5 show the normalized performance on FlashInfer kernels in cascading and non-cascading setting over vLLM implementation. FlashInfer kernels in both settings outperforms vLLM kernels, and cascading kernels significant speedup over non-Cascade Inference kernels in most cases. The benefit of cascade inference increases as shared prefix length and batch size grows (where the prefill kernel dominates execution time) and decreases as we increase unique suffix length (where the batch decode kernel dominates execution time). For very long shared prompt (32768), the decode kernel can get up to 31x speedup on H100 SXM 80GB with large batch size(≥128) and short unique kv-length (≤256).