Skip to content

Commit b558807

Browse files
committed
fix typos
1 parent 6cf8629 commit b558807

File tree

2 files changed

+8
-8
lines changed

2 files changed

+8
-8
lines changed

_posts/2024-01-03-introduce-flashinfer.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ FlashInfer also implemented batch append attention kernel where key/value is sto
161161
<p align="center">
162162
<img src="/assets/imgs/single-gqa-benchmark.png" alt="single gqa benchmarks" width="800"/>
163163
<br>
164-
Figure 8: Single request GQA decode performance, use llama2-70b setting: tp=2, num_kv_heads=4, num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 8192.
164+
Figure 8: Single request GQA decode performance, use llama2-70b setting: tp=2, num_kv_heads=4, num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 65536.
165165
</p>
166166

167167
For single-request GQA decoding attention, FlashInfer (Tensor Cores) achieves better performance than FlashAttention 2.4.2 on both A100 & H100, and FlashInfer (CUDA Cores) can only achieve 40%+ bandwidth utilization because of limited CUDA Cores performance.
@@ -196,7 +196,7 @@ FlashInfer implements high-performance fp8 decode decode kernels, which could ac
196196
<p align="center">
197197
<img src="/assets/imgs/fp8-attention.png" alt="fp8 attention" width="800"/>
198198
<br>
199-
Figure 11: FP8 decode attention performance, use llama-7b setting: num_kv_heads=num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 8192.
199+
Figure 11: FP8 decode attention performance, use Llama2-7B setting: num_kv_heads=num_qo_heads=32, head_dim=128. Sequence length varies from 32 to 65536.
200200
</p>
201201

202202
There is some gap between bandwidth utilization of fp8 and fp16 kernels, however the gap is getting closer as the query length grows.

_posts/2024-01-08-cascade-inference.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ The above n-ary merge operator is consistent with the binary merge operator, and
6464
<p align="center">
6565
<img src="/assets/imgs/recursive-attention.png" alt="recursive-attention" width="800"/>
6666
<br>
67-
Figure 3. Different order to merge attention states are mathematically equivalent.
67+
Figure 2. Different order to merge attention states are mathematically equivalent.
6868
</p>
6969

7070
Recursive Attention allow us to decompose attention computation into multiple stages, different stages
@@ -78,12 +78,12 @@ we propose the following Divide-and-Conquer algorithm:
7878
2. Use batch decode attention kernel to compute the attention state between queries and KV-Cache of unique suffixes.
7979
3. Use merge operator to combine two attention states to get the final attention output.
8080

81-
The overall workflow is explained on the left side of Figure 2, different color of rectangles are processed in different thread blocks in GPU. Note that for multi-query attention kernels, we access KV-Cache through SMEM or registers and for decode kernels we can only access KV-Cache through L2 Cache or Global Memory. Cascade Inference allow us to maximize memory reuse for common prefix, thus making the attention computation much more memory efficient.
81+
The overall workflow is explained on the left side of Figure 3, different color of rectangles are processed in different thread blocks in GPU. Note that for multi-query attention kernels, we access KV-Cache through SMEM or registers and for decode kernels we can only access KV-Cache through L2 Cache or Global Memory. Cascade Inference allow us to maximize memory reuse for common prefix, thus making the attention computation much more memory efficient.
8282

8383
<p align="center">
8484
<img src="/assets/imgs/cascade-inference.png" alt="Cascade Inference" width="800"/>
8585
<br>
86-
Figure 2. Workflow of Cascade Inference, throughput values adapted from blog: <a href="https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38">TPU vs GPU vs Cerebras vs Graphcore: A Fair Comparison between ML Hardware</a>
86+
Figure 3. Workflow of Cascade Inference, throughput values adapted from blog: <a href="https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38">TPU vs GPU vs Cerebras vs Graphcore: A Fair Comparison between ML Hardware</a>
8787
</p>
8888

8989
We call the divide-and-conquer approach for shared-prefix attention the "Cascade Inference".
@@ -95,16 +95,16 @@ We evaluate Cascade Inference on H100 SXM 80GB and A100 PCIE 80GB GPUs. The inpu
9595
<p align="center">
9696
<img src="/assets/imgs/cascade-inference-performance-h100.png" alt="speedup-h100" width="800"/>
9797
<br>
98-
Figure 3. Speedup over vLLM PageAttention on H100 SXM 80GB
98+
Figure 4. Speedup over vLLM PageAttention on H100 SXM 80GB
9999
</p>
100100

101101
<p align="center">
102102
<img src="/assets/imgs/cascade-inference-performance-a100.png" alt="speedup-a100" width="800"/>
103103
<br>
104-
Figure 4. Speedup over vLLM PageAttention on A100 PCIe 80GB
104+
Figure 5. Speedup over vLLM PageAttention on A100 PCIe 80GB
105105
</p>
106106

107-
Figure 3 and 4 show the normalized performance on FlashInfer kernels in cascading and non-cascading setting
107+
Figure 4 and 5 show the normalized performance on FlashInfer kernels in cascading and non-cascading setting
108108
over vLLM implementation. FlashInfer kernels in both settings outperforms vLLM kernels, and cascading kernels significant speedup over non-Cascade Inference kernels in most cases.
109109
The benefit of cascade inference increases as shared prefix length and batch size grows (where the prefill kernel dominates execution time) and decreases as we increase unique suffix length (where the batch decode kernel dominates execution time). For very long shared prompt (32768), the decode kernel can get up to 31x speedup on H100 SXM 80GB with large batch size(≥128) and short unique kv-length (≤256).
110110

0 commit comments

Comments
 (0)