Skip to content

How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116

@vedantroy

Description

@vedantroy

I'm running my code with:

env CUDNN_LOGERR_DBG=1  CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train

and getting errors like:

[rank5]: RuntimeError: /home/ved/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:358 in function fused_attn_arbitrary_seqlen_fwd_impl: cuDNN Error: execute(handle, plan->get_raw_desc(), variant_pack_descriptor.get_ptr()) failed with code: CUDNN_STATUS_EXECUTION_FAILED, and message: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream). For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

I'm using a pretty standard DotProductAttention:

        self.te_attn = te.DotProductAttention(
                num_attention_heads=24,
                kv_channels=self.128,
                qkv_format="thd", # tokens, head, dim
                attn_mask_type="padding",        
       )

and I'm also calling it in a pretty standard way (all the assertions pass):

                assert qkv.shape == (total, 3, self.num_heads, self.head_dim)
                q, k, v = torch.unbind(qkv, dim=1)

                assert q.shape == k.shape == v.shape
                assert q.shape == (total, self.num_heads, self.head_dim)
                assert cu_seqlens.shape[0] == B + 1

                xy: torch.Tensor = self.te_attn(
                    q, k, v,
                    cu_seqlens_q=cu_seqlens,
                    cu_seqlens_kv=cu_seqlens,
                    max_seqlen_q=max_seqlen_in_batch,
                    max_seqlen_kv=max_seqlen_in_batch,
                )

I'm kind of stuck on how to debug this. Seems like something is wrong with reading the inputs? Not sure. How should I proceed in debugging this?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions