Skip to content

MIOpen(HIP): Error using BatchNorm #3943

@johnlockejrr

Description

@johnlockejrr
pytorch-lightning    2.6.0
torch                2.9.1+rocm7.1.1.lw.git351ff442
torchaudio           2.9.0+rocm7.1.1.gite3c6ee2b
torchmetrics         1.8.2
torchvision          0.24.0+rocm7.1.1.gitb919bd0c
GPU: gfx1151
Python: 3.12.12
OS: Ubuntu 22.04 LTS
ROCm version: rocm-7.1.1

Everything works fine in my box: llm.cpp, Ollama, vLLM, lemonade... I trained YOLO, RF-DETR models...

Now, when I train using BatchNorm (torch.nn.BatchNorm2d) it crashes. If I use torch.nn.GroupNorm instead torch.nn.BatchNorm2d it works but works like using CPU (even I'm on GPU).

Here is my error:

Using Python 3.12.12 environment at: /home/incognito/AI/pylaia-reloaded/.venv
Seed set to 74565
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
You are using a CUDA device ('AMD Radeon Graphics') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
┏━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃   ┃ Name      ┃ Type     ┃ Params ┃ Mode  ┃ FLOPs ┃
┡━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ 0 │ model     │ LaiaCRNN │  5.3 M │ train │     0 │
│ 1 │ criterion │ CTCLoss  │      0 │ train │     0 │
└───┴───────────┴──────────┴────────┴───────┴───────┘
Trainable params: 5.3 M
Non-trainable params: 0
Total params: 5.3 M
Total estimated model params size (MB): 21
Modules in train mode: 25
Modules in eval mode: 0
Total FLOPs: 0
TR - E0:   0%|                                                                                                                                                                                          | 0/2430 [00:00<?, ?it/s]<inline asm>:14:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:15:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:17:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:31 row_mask:0xc
                   ^
<inline asm>:18:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:31 row_mask:0xc
                   ^
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
4 errors generated.

MIOpen Error: /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/hipoc/hipoc_program.cpp:299: Code object build failed. Source: MIOpenBatchNormFwdTrainSpatial.cl
[2025-12-05 10:46:54,430 CRITICAL laia] Uncaught exception:
Traceback (most recent call last):
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/engine/engine_exception.py", line 27, in exception_catcher
    yield
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/engine/engine_module.py", line 130, in training_step
    batch_y_hat = self.model(batch_x)
                  ^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/models/htr/laia_crnn.py", line 116, in forward
    x = self.conv(x)
        ^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/container.py", line 250, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/models/htr/conv_block.py", line 81, in forward
    x = self.batchnorm(x)
        ^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/batchnorm.py", line 193, in forward
    return F.batch_norm(
           ^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/functional.py", line 2813, in batch_norm
    return torch.batch_norm(
           ^^^^^^^^^^^^^^^^^
RuntimeError: miopenStatusUnknownError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/incognito/AI/pylaia-reloaded/.venv/bin/pylaia-htr-train-ctc", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/scripts/htr/train_ctc.py", line 255, in main
    run(**args)
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/scripts/htr/train_ctc.py", line 181, in run
    trainer.fit(engine_module, datamodule=data_module)
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 584, in fit
    call._call_and_handle_interrupt(
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 49, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 630, in _fit_impl
    self._run(model, ckpt_path=ckpt_path, weights_only=weights_only)
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1123, in _run_stage
    self.fit_loop.run()
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 217, in run
    self.advance()
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 465, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 153, in run
    self.advance(data_fetcher)
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 352, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 192, in run
    self._optimizer_step(batch_idx, closure)
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 270, in _optimizer_step
    call._call_lightning_module_hook(
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 177, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/core/module.py", line 1368, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/core/optimizer.py", line 154, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/strategies/strategy.py", line 239, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/plugins/precision/precision.py", line 123, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/optim/optimizer.py", line 517, in wrapper
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/optim/optimizer.py", line 82, in _use_grad
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/optim/rmsprop.py", line 157, in step
    loss = closure()
           ^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/plugins/precision/precision.py", line 109, in _wrap_closure
    closure_result = closure()
                     ^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 146, in __call__
    self._result = self.closure(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 131, in closure
    step_output = self._step_fn()
                  ^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 319, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 329, in _call_strategy_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/strategies/strategy.py", line 391, in training_step
    return self.lightning_module.training_step(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/engine/htr_engine_module.py", line 39, in training_step
    result = super().training_step(batch, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/engine/engine_module.py", line 129, in training_step
    with self.exception_catcher(batch):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/engine/engine_exception.py", line 29, in exception_catcher
    raise EngineException(
laia.engine.engine_exception.EngineException: Exception "RuntimeError('miopenStatusUnknownError')" raised during epoch=0, global_step=0 with batch=['train/BL_Add_MS_21581_113_line_023', 'train/Hebrew_Samaritan_wu1y3qt9c6dgp.13_line_017', 'train/BL_Or_6461_109_line_013', 'train/CUL_MS_Add._713_0067_b_line_012', 'train/HCL_Ms._Sam._1_0097_a_line_009', 'train/BL_Or_6461_104_line_000', 'train/UBL_Vollers_1120_252_line_011', 'train/CTC_R.15.54_030_line_020', 'train/BNF_Sam.2_221_line_026', 'train/CBL_Ms._Heb_752_217_line_002', 'train/BL_Or_12269_059_line_025', 'train/HCL_Ms._Sam._1_0168_b_line_006', 'train/CUL_MS_Add._713_0192_a_line_005', 'train/NLI_Ms._Sam._86_382_line_024', 'train/NLR_Sam._IIA_64_027a_line_011', 'train/CUL_MS_Add._713_0185_a_line_007']
TR - E0:   0%|          | 0/2430 [00:00<?, ?it/s]

Problematic code:

        # Add Batch normalization
        self.batchnorm = nn.BatchNorm2d(out_channels) if batchnorm else None

Simple code to reproduce the error:

import torch
import torch.nn as nn

# Make sure you're on ROCm and have a GPU available
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))

# Define a simple BatchNorm2d layer
bn = nn.BatchNorm2d(16).cuda()

# Create a random input tensor on GPU
x = torch.randn(8, 16, 32, 32, device="cuda")  # (N, C, H, W)

# Forward pass (this is where MIOpen may crash)
y = bn(x)

print("Output shape:", y.shape)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions