-
Notifications
You must be signed in to change notification settings - Fork 621
Closed as not planned
Labels
bugSomething isn't workingSomething isn't working
Description
I'm running my code with:
env CUDNN_LOGERR_DBG=1 CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train
and getting errors like:
[rank5]: RuntimeError: /home/ved/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:358 in function fused_attn_arbitrary_seqlen_fwd_impl: cuDNN Error: execute(handle, plan->get_raw_desc(), variant_pack_descriptor.get_ptr()) failed with code: CUDNN_STATUS_EXECUTION_FAILED, and message: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream). For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.
I'm using a pretty standard DotProductAttention:
self.te_attn = te.DotProductAttention(
num_attention_heads=24,
kv_channels=self.128,
qkv_format="thd", # tokens, head, dim
attn_mask_type="padding",
)
and I'm also calling it in a pretty standard way (all the assertions pass):
assert qkv.shape == (total, 3, self.num_heads, self.head_dim)
q, k, v = torch.unbind(qkv, dim=1)
assert q.shape == k.shape == v.shape
assert q.shape == (total, self.num_heads, self.head_dim)
assert cu_seqlens.shape[0] == B + 1
xy: torch.Tensor = self.te_attn(
q, k, v,
cu_seqlens_q=cu_seqlens,
cu_seqlens_kv=cu_seqlens,
max_seqlen_q=max_seqlen_in_batch,
max_seqlen_kv=max_seqlen_in_batch,
)
I'm kind of stuck on how to debug this. Seems like something is wrong with reading the inputs? Not sure. How should I proceed in debugging this?
NicoleMayer
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working