Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frugally-deep 0.16.0 appears to break kernel/model files #3588

Open
MCJack123 opened this issue Mar 8, 2025 · 3 comments
Open

frugally-deep 0.16.0 appears to break kernel/model files #3588

MCJack123 opened this issue Mar 8, 2025 · 3 comments

Comments

@MCJack123
Copy link

I recently updated my AI workflow to ROCm 6.3.2 on Arch Linux, and found that some PyTorch operations were crashing with "MIOpen Error: tensor_shape_variable needs to be an array". With a bit of debugging, I was able to narrow it down to fdeep::internal::create_tensor_shape_variable_offset getting an incorrect parameter. I looked around the source of frugally-deep and the model it was loading a bit, and noticed that fdeep was looking for batch_shape, while the model file used batch_input_shape.

This change in 0.16.0 appears to be causing this specific issue: Dobiasd/frugally-deep@a60717c#diff-a674970aa0b9e26d68cc8783ce1aa3f82425780a062969020febb6fda1371701L500-R507 The change modified the expected key from batch_input_shape to batch_shape. However, after fixing that, I found that inbound_nodes now has a significantly different structure as well, also shown in the above commit. I'm not well versed in the inner workings of this stuff, but I'm guessing there's a new file format with TensorFlow 2.16.1 that breaks the old files, and fdeep's update changes it to use that format instead.

The files src/kernels/gfx9[08|0a|42].tn.model will need to be updated to this new format to support frugally-deep 0.16.0 when built with MIOPEN_ENABLE_AI_KERNEL_TUNING (which is default). I'd update it myself in a PR if I knew the format, and was confident it fixed the issue without causing problems, but that is not the case.

@sakura-nyaa
Copy link

I'm also getting this error:
"MIOpen Error: tensor_shape_variable needs to be an array"

I get the error whenever using torch.nn.functional.conv2d.
Here's an output with MIOpen logging turned on when running a minimal program to trigger the error:

MIOpen(HIP): Info [get_device_name] Raw device name: gfx1102
MIOpen(HIP): Info [Handle] stream: 0, device_id: 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1102
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP):    tensorDesc = 0x7ffc6f739908
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP):    tensorDesc = {}, {}, packed,
MIOpen(HIP):    dataType = 1
MIOpen(HIP):    nbDims = 4
MIOpen(HIP):    dim.values = { 32 4 32 32 }
MIOpen(HIP):    stride.values = { 4096 1024 32 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP):    tensorDesc = 0x56a9c37153c0
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP):    tensorDesc = {}, {}, packed,
MIOpen(HIP):    dataType = 1
MIOpen(HIP):    nbDims = 4
MIOpen(HIP):    dim.values = { 32 4 3 3 }
MIOpen(HIP):    stride.values = { 36 9 3 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP):    tensorDesc = 0x7e3015452807
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP):    tensorDesc = {}, {}, packed,
MIOpen(HIP):    dataType = 1
MIOpen(HIP):    nbDims = 4
MIOpen(HIP):    dim.values = { 32 32 30 30 }
MIOpen(HIP):    stride.values = { 28800 900 30 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateConvolutionDescriptor(miopenConvolutionDescriptor_t *){
MIOpen(HIP):    convDesc = 0x100
MIOpen(HIP): }
MIOpen(HIP): Info [] MIOPEN_FIND_MODE = DYNAMIC_HYBRID(5)
MIOpen(HIP): miopenStatus_t miopenInitConvolutionNdDescriptor(miopenConvolutionDescriptor_t, int, const int *, const int *, const int *, miopenConvolutionMode_t){
MIOpen(HIP):    convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
MIOpen(HIP):    spatialDim = 2
MIOpen(HIP):    pads = { 0 0 }
MIOpen(HIP):    strides = { 1 1 }
MIOpen(HIP):    dilations = { 1 1 }
MIOpen(HIP):    c_mode = 0
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetConvolutionGroupCount(miopenConvolutionDescriptor_t, int){
MIOpen(HIP):    convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
MIOpen(HIP):    groupCount = 1
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetConvolutionAttribute(miopenConvolutionDescriptor_t, const miopenConvolutionAttrib_t, const int){
MIOpen(HIP):    convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
MIOpen(HIP):    attr = 1
MIOpen(HIP):    value = 0
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenConvolutionForwardGetWorkSpaceSize(miopenHandle_t, const miopenTensorDescriptor_t, const miopenTensorDescriptor_t, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, size_t *){
MIOpen(HIP):    handle = stream: 0, device_id: 0
MIOpen(HIP):    wDesc = {32, 4, 3, 3}, {36, 9, 3, 1}, packed,
MIOpen(HIP):    xDesc = {32, 4, 32, 32}, {4096, 1024, 32, 1}, packed,
MIOpen(HIP):    convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
MIOpen(HIP):    yDesc = {32, 32, 30, 30}, {28800, 900, 30, 1}, packed,
MIOpen(HIP): }
MIOpen(HIP): Info [AmdRocmMetadataVersionDetect] ROCm MD version AMDHSA_COv3, HIP version 6.3.42134, MIOpen version 3.3.0.d22d5a13f-dirty
MIOpen(HIP): Info2 [GetWorkSpaceSize]
MIOpen(HIP): Info [GetSolutions]
MIOpen(HIP): Info [IsNetworkedFilesystem] Filesystem type at '"/home/neil//.config/miopen/"' is: 0xef53 'EXT2/3/4_SUPER_MAGIC'
MIOpen(HIP): Info2 [GetLibPath] Lib Path: "/opt/rocm/lib/libMIOpen.so.1.0"
MIOpen(HIP): Info2 [GetInstalledPathFile] inexact find database search
MIOpen(HIP): Info2 [GetInstalledPathFile] Iterating over find db directory "/opt/rocm/share/miopen/db"
MIOpen(HIP): Info [Measure] ReadonlyRamDb::Prefetch time: 5e-05 ms
MIOpen(HIP): Info [Prefetch] File is unreadable: "/home/neil//.config/miopen/gfx1102_16.HIP.3_3_0_d22d5a13f-dirty.ufdb.txt"
MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 0.00856 ms
MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 4-32-32-3x3-32-30-30-32-0x0-1x1-1x1-0-NCHW-FP32-F in cache for file "/home/neil//.config/miopen/gfx1102_16.HIP.3_3_0_d22d5a13f-dirty.ufdb.txt"
MIOpen(HIP): Info2 [FindRecord] Looking for key 4-32-32-3x3-32-30-30-32-0x0-1x1-1x1-0-NCHW-FP32-F in file ""
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.02485 ms
MIOpen Error: tensor_shape_variable needs to be an array
MIOpen(HIP): miopenStatus_t miopenFindConvolutionForwardAlgorithm(miopenHandle_t, const miopenTensorDescriptor_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, void *, const int, int *, miopenConvAlgoPerf_t *, void *, size_t, bool){
MIOpen(HIP):    handle = stream: 0, device_id: 0
MIOpen(HIP):    xDesc = {32, 4, 32, 32}, {4096, 1024, 32, 1}, packed,
MIOpen(HIP):    x = 0x7e2e66801200
MIOpen(HIP):    wDesc = {32, 4, 3, 3}, {36, 9, 3, 1}, packed,
MIOpen(HIP):    w = 0x7e2e66800000
MIOpen(HIP):    convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
MIOpen(HIP):    yDesc = {32, 32, 30, 30}, {28800, 900, 30, 1}, packed,
MIOpen(HIP):    y = 0x7e2d5d800000
MIOpen(HIP):    requestAlgoCount = 1
MIOpen(HIP):    returnedAlgoCount = 32764
MIOpen(HIP):    perfResults =
MIOpen(HIP):    workSpace = nullptr
MIOpen(HIP):    workSpaceSize = 0
MIOpen(HIP):    exhaustiveSearch = 0
MIOpen(HIP): }
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 0
MIOpen(HIP): Info [GetSolutions]
MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 4-32-32-3x3-32-30-30-32-0x0-1x1-1x1-0-NCHW-FP32-F in cache for file "/home/neil//.config/miopen/gfx1102_16.HIP.3_3_0_d22d5a13f-dirty.ufdb.txt"
MIOpen(HIP): Info2 [FindRecord] Looking for key 4-32-32-3x3-32-30-30-32-0x0-1x1-1x1-0-NCHW-FP32-F in file ""
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.025221 ms
MIOpen Error: tensor_shape_variable needs to be an array
MIOpen(HIP): miopenStatus_t miopenDestroyConvolutionDescriptor(miopenConvolutionDescriptor_t){
MIOpen(HIP):    convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){
MIOpen(HIP):    tensorDesc = {32, 4, 3, 3}, {36, 9, 3, 1}, packed,
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){
MIOpen(HIP):    tensorDesc = {32, 32, 30, 30}, {28800, 900, 30, 1}, packed,
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){
MIOpen(HIP):    tensorDesc = {32, 4, 32, 32}, {4096, 1024, 32, 1}, packed,
MIOpen(HIP): }
Traceback (most recent call last):
  File "/home/neil/trigger_error.py", line 7, in <module>
    result = F.conv2d(input, weight)
RuntimeError: miopenStatusUnknownError

heres the minimal program:

import torch
from torch.nn import functional as F

weight = torch.randn(32, 4, 3, 3).cuda()
input = torch.randn(32, 4, 32, 32).cuda()

result = F.conv2d(input, weight)
print(f"{result.shape=} {result.dtype=} {result.device=}")

I'll try parsing the findings of the OP and see if I can get it working. If I do I'll report back.

@IMbackK
Copy link

IMbackK commented Mar 11, 2025

I can repoduce this issue.

MIOpenDriver might be a more convenient way to reproduce this issue, as shown in #3597

@IMbackK
Copy link

IMbackK commented Mar 11, 2025

I can confirm this issue stems from frugally-deep 0.16+, building miopen against frugally-deep 0.15.20 avoids this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants