Skip to content

Commit

Permalink
fix typos
Browse files Browse the repository at this point in the history
  • Loading branch information
yzh119 committed Feb 9, 2024
1 parent 6cf8629 commit b558807
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
4 changes: 2 additions & 2 deletions _posts/2024-01-03-introduce-flashinfer.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ FlashInfer also implemented batch append attention kernel where key/value is sto
<p align="center">
<img src="/assets/imgs/single-gqa-benchmark.png" alt="single gqa benchmarks" width="800"/>
<br>
Figure 8: Single request GQA decode performance, use llama2-70b setting: tp=2, num_kv_heads=4, num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 8192.
Figure 8: Single request GQA decode performance, use llama2-70b setting: tp=2, num_kv_heads=4, num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 65536.
</p>

For single-request GQA decoding attention, FlashInfer (Tensor Cores) achieves better performance than FlashAttention 2.4.2 on both A100 & H100, and FlashInfer (CUDA Cores) can only achieve 40%+ bandwidth utilization because of limited CUDA Cores performance.
Expand Down Expand Up @@ -196,7 +196,7 @@ FlashInfer implements high-performance fp8 decode decode kernels, which could ac
<p align="center">
<img src="/assets/imgs/fp8-attention.png" alt="fp8 attention" width="800"/>
<br>
Figure 11: FP8 decode attention performance, use llama-7b setting: num_kv_heads=num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 8192.
Figure 11: FP8 decode attention performance, use Llama2-7B setting: num_kv_heads=num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 65536.
</p>

There is some gap between bandwidth utilization of fp8 and fp16 kernels, however the gap is getting closer as the query length grows.
Expand Down
12 changes: 6 additions & 6 deletions _posts/2024-01-08-cascade-inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ The above n-ary merge operator is consistent with the binary merge operator, and
<p align="center">
<img src="/assets/imgs/recursive-attention.png" alt="recursive-attention" width="800"/>
<br>
Figure 3. Different order to merge attention states are mathematically equivalent.
Figure 2. Different order to merge attention states are mathematically equivalent.
</p>

Recursive Attention allow us to decompose attention computation into multiple stages, different stages
Expand All @@ -78,12 +78,12 @@ we propose the following Divide-and-Conquer algorithm:
2. Use batch decode attention kernel to compute the attention state between queries and KV-Cache of unique suffixes.
3. Use merge operator to combine two attention states to get the final attention output.

The overall workflow is explained on the left side of Figure 2, different color of rectangles are processed in different thread blocks in GPU. Note that for multi-query attention kernels, we access KV-Cache through SMEM or registers and for decode kernels we can only access KV-Cache through L2 Cache or Global Memory. Cascade Inference allow us to maximize memory reuse for common prefix, thus making the attention computation much more memory efficient.
The overall workflow is explained on the left side of Figure 3, different color of rectangles are processed in different thread blocks in GPU. Note that for multi-query attention kernels, we access KV-Cache through SMEM or registers and for decode kernels we can only access KV-Cache through L2 Cache or Global Memory. Cascade Inference allow us to maximize memory reuse for common prefix, thus making the attention computation much more memory efficient.

<p align="center">
<img src="/assets/imgs/cascade-inference.png" alt="Cascade Inference" width="800"/>
<br>
Figure 2. Workflow of Cascade Inference, throughput values adapted from blog: <a href="https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38">TPU vs GPU vs Cerebras vs Graphcore: A Fair Comparison between ML Hardware</a>
Figure 3. Workflow of Cascade Inference, throughput values adapted from blog: <a href="https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38">TPU vs GPU vs Cerebras vs Graphcore: A Fair Comparison between ML Hardware</a>
</p>

We call the divide-and-conquer approach for shared-prefix attention the "Cascade Inference".
Expand All @@ -95,16 +95,16 @@ We evaluate Cascade Inference on H100 SXM 80GB and A100 PCIE 80GB GPUs. The inpu
<p align="center">
<img src="/assets/imgs/cascade-inference-performance-h100.png" alt="speedup-h100" width="800"/>
<br>
Figure 3. Speedup over vLLM PageAttention on H100 SXM 80GB
Figure 4. Speedup over vLLM PageAttention on H100 SXM 80GB
</p>

<p align="center">
<img src="/assets/imgs/cascade-inference-performance-a100.png" alt="speedup-a100" width="800"/>
<br>
Figure 4. Speedup over vLLM PageAttention on A100 PCIe 80GB
Figure 5. Speedup over vLLM PageAttention on A100 PCIe 80GB
</p>

Figure 3 and 4 show the normalized performance on FlashInfer kernels in cascading and non-cascading setting
Figure 4 and 5 show the normalized performance on FlashInfer kernels in cascading and non-cascading setting
over vLLM implementation. FlashInfer kernels in both settings outperforms vLLM kernels, and cascading kernels significant speedup over non-Cascade Inference kernels in most cases.
The benefit of cascade inference increases as shared prefix length and batch size grows (where the prefill kernel dominates execution time) and decreases as we increase unique suffix length (where the batch decode kernel dominates execution time). For very long shared prompt (32768), the decode kernel can get up to 31x speedup on H100 SXM 80GB with large batch size(≥128) and short unique kv-length (≤256).

Expand Down

0 comments on commit b558807

Please sign in to comment.