This repo contains the artifact for our PPoPP paper Dynamic N:M Fine-grained Structured Sparse Attention Mechanism.
The accuracy evaluation script requires two A100 GPUs. The Speedup evaluation takes one A100 GPU. Other pre-Ampere GPUs are not supported as DFSS leverages the Ampere sparse tensor core. Other requirements are covered by the docker file.
Get source code with
https://github.com/apuaaChen/DFSS.git
Get the submodules
git submodule update --init --recursive
We use NGC pytorch container 21.06. To build the container, run
cd docker && bash build.sh
To launch the container, run
cd .. && bash docker/launch.sh
The code is mounted to /workspace/dfss
.
Our package pydfss
can be installed with
cd /workspace/dfss && bash install.sh
We provide the script to reproduce the attention speedup under different sequence length with bfloat16
data type.
python benchmark.py
As mentioned in the paper, we only compare the QK^T
, Softmax
and AV
in this script, as the optimizations in other parts are orthogonal to our DFSS. The expected result could be around
attention speedup: 1.38 ~ 1.86
We provide training and inference scripts of BERT-large on SQuAD v1.1 with DFSS 2:4 under bfloat16
data type (Table 2 in the paper). The script requires 2 A100 GPUs, and it takes about 1.5 hour to finish.
mkdir ckpt && python bert_squad_finetuning.py
Expected result would be
F1 score on BERT-large SQuAD v1.1
Transformer: 93.10, DFSS 2:4: 93.19