-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Describe the bug
DEBUG:DotProductAttention:Available backends = {FlashAttention=False, FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = NoBackend
To Reproduce
KV_LORA_RANK=512
QK_NOPE_HEAD_DIM=128
QK_ROPE_HEAD_DIM=64
V_HEAD_DIM=128
NUM_EXPERTS=32
ROUTER_TOPK=3
NUM_SHARED_EXPERTS=1
MOE_LAYER_FREQ=1
MOE_FIRST_K_DENSE_REPLACE=2
RMS_NORM_EPS=1e-6
MOE_ARGS=(
--moe-ffn-hidden-size ${MOE_INTERMEDIATE_SIZE}
--moe-router-topk ${ROUTER_TOPK}
--num-experts ${NUM_EXPERTS}
--moe-layer-freq ${MOE_LAYER_FREQ}
--moe-aux-loss-coeff 0.001
--moe-shared-expert-intermediate-size
--kv-lora-rank ${KV_LORA_RANK}
--qk-head-dim ${QK_NOPE_HEAD_DIM}
--qk-pos-emb-head-dim ${QK_ROPE_HEAD_DIM}
--v-head-dim ${V_HEAD_DIM}
--moe-grouped-gemm ###
)
export NVTE_FLASH_ATTN=1 NVTE_FUSED_ATTN=0
fl_options=" --attention-backend flash "
--multi-latent-attention
--moe-router-dtype fp32 ##moe
--moe-permute-fusion
Expected behavior
MLA can support flashattention, if not, GPU will get oom error when sequence length longer than 20K.
Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
- Megatron-LM commit ID
- nvcr.io/nvidia/pytorch:24.05-py3
A100 80G, cuda 12.4 torch2.7.1 te 2.4.0 flash attn 2.4.2
Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context
Add any other context about the problem here.