How to debug CUDNN_STATUS_EXECUTION_FAILED?

I'm running my code with:
```
env CUDNN_LOGERR_DBG=1  CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train
```

and getting errors like:
```
[rank5]: RuntimeError: /home/ved/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:358 in function fused_attn_arbitrary_seqlen_fwd_impl: cuDNN Error: execute(handle, plan->get_raw_desc(), variant_pack_descriptor.get_ptr()) failed with code: CUDNN_STATUS_EXECUTION_FAILED, and message: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream). For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.
```

I'm using a pretty standard `DotProductAttention`:
```
        self.te_attn = te.DotProductAttention(
                num_attention_heads=24,
                kv_channels=self.128,
                qkv_format="thd", # tokens, head, dim
                attn_mask_type="padding",        
       )
```

and I'm also calling it in a pretty standard way (all the assertions pass):
```
                assert qkv.shape == (total, 3, self.num_heads, self.head_dim)
                q, k, v = torch.unbind(qkv, dim=1)

                assert q.shape == k.shape == v.shape
                assert q.shape == (total, self.num_heads, self.head_dim)
                assert cu_seqlens.shape[0] == B + 1

                xy: torch.Tensor = self.te_attn(
                    q, k, v,
                    cu_seqlens_q=cu_seqlens,
                    cu_seqlens_kv=cu_seqlens,
                    max_seqlen_q=max_seqlen_in_batch,
                    max_seqlen_kv=max_seqlen_in_batch,
                )
```

I'm kind of stuck on how to debug this. Seems like something is wrong with reading the inputs? Not sure. How should I proceed in debugging this?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions