In the source file https://github.com/ROCm/aiter/blob/main/csrc/kernels/topk_plain_kernels.cu, the kernel is launched when
// Use topk_per_row kernel when:
// n + K log²K ≥ 3 × Factor(n) × n
// where Factor(n) = 1/3 + 1.6/(log₂(n) - 9.5)
Can you please explain these magic numbers ?
When this HIP kernel is converted to CUDA kernel, can you suggest where users need to do performance tuning targeting an NVIDIA GPU ?
Thanks