I have implemented a module using OpenMP and CUDA that runs faster while maintaining the memory efficiency of your CuPy implementation. [shikishima-TasakiLab/Involution-PyTorch](https://github.com/shikishima-TasakiLab/Involution-PyTorch) It also supports TorchScript and 16-bit float. shikishima-TasakiLab/Involution-PyTorch#1