Skip to content

vllm-project/flash-attention

 
 

Repository files navigation

FlashAttention for vLLM

This is a fork of https://github.com/Dao-AILab/flash-attention customized for vLLM.

We have the following customizations:

  • Build: Cmake, torch library (this package is bundled into vLLM).
  • Size: reduced templating and removal of (training) kernels
  • Features: Small page size support (FA2), DCP support (FA3)
  • Performance: Some decode specific optimizations for sizes we care about; as well as mixed batch performance optimizations. (Upstream is understandably hesitant on specializing for inference as they also need to support training; we on the other hand compile out the backward pass kernels and do not test that our optimizations do not break them.)

About

Fast and memory-efficient exact attention

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 46.1%
  • C++ 40.1%
  • Cuda 12.9%
  • CMake 0.7%
  • C 0.1%
  • Dockerfile 0.1%