Hongik University Undergraduate
Pinned Loading
-
flash-attention-cuda
flash-attention-cuda PublicCustom FlashAttention CUDA kernel for Llama 3 8B inference. 2.26x speedup over naive attention, targeting RTX 4060 Ti → Jetson AGX Orin deployment.
Python 1
-
flashattn-cuda-metal
flashattn-cuda-metal PublicFlashAttention CUDA kernel implementation and Metal port (RTX 4060 Ti, Apple M4 Pro)
Python 1
-
sdpa-attention-benchmark
sdpa-attention-benchmark PublicBenchmark PyTorch SDPA backends (math vs flash) on RTX 4060 Ti with Nsight Systems profiling
Python 1
Something went wrong, please refresh the page to try again.
If the problem persists, check the GitHub status page or contact support.
If the problem persists, check the GitHub status page or contact support.
