Bias92

노란토끼 Bias92

Hongik University Undergraduate

Achievements

flash-attention-cuda flash-attention-cuda Public

Custom FlashAttention CUDA kernel for Llama 3 8B inference. 2.26x speedup over naive attention, targeting RTX 4060 Ti → Jetson AGX Orin deployment.

Python 1
flashattn-cuda-metal flashattn-cuda-metal Public

FlashAttention CUDA kernel implementation and Metal port (RTX 4060 Ti, Apple M4 Pro)

Python 1
sdpa-attention-benchmark sdpa-attention-benchmark Public

Benchmark PyTorch SDPA backends (math vs flash) on RTX 4060 Ti with Nsight Systems profiling

Python 1