Replies: 1 comment
-
There are some fused kernels in the repository, such as fused_bias_geglu, which fuses bias addition, GELU activation and gating. However, it does look like some of the more advanced fused kernels are not currently released to the public. Looking at the
If I didn't miss anything and they are indeed not released, do you think it would be possible to open-source them? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As the doc DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, DeepSpeed has deep fusion and inference-customized GeMM to improve performance for inference. However, from the source code (e.g. pt_bindings.cpp and ds_transformer_cuda.cpp), kernels seem not to be fused as said in DeepSpeed Inference: Multi-GPU inference with customized inference kernels and quantization support (e.g. "Input Layer-Norm plus Query, Key, and Value GeMMs and their bias adds.", "Intermediate FF, Layer-Norm, Bias-add, Residual, and Gaussian Error Linear Unit (GELU)"), and GeMM is delegated to cuBLAS. Does anybody know where to find the implementation of deeply fused kernels and customized GeMM kernels? Thanks~
Beta Was this translation helpful? Give feedback.
All reactions