-
Notifications
You must be signed in to change notification settings - Fork 271
Open
Description
pytorch-lightning 2.6.0
torch 2.9.1+rocm7.1.1.lw.git351ff442
torchaudio 2.9.0+rocm7.1.1.gite3c6ee2b
torchmetrics 1.8.2
torchvision 0.24.0+rocm7.1.1.gitb919bd0c
GPU: gfx1151
Python: 3.12.12
OS: Ubuntu 22.04 LTS
ROCm version: rocm-7.1.1
Everything works fine in my box: llm.cpp, Ollama, vLLM, lemonade... I trained YOLO, RF-DETR models...
Now, when I train using BatchNorm (torch.nn.BatchNorm2d) it crashes. If I use torch.nn.GroupNorm instead torch.nn.BatchNorm2d it works but works like using CPU (even I'm on GPU).
Here is my error:
Using Python 3.12.12 environment at: /home/incognito/AI/pylaia-reloaded/.venv
Seed set to 74565
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
You are using a CUDA device ('AMD Radeon Graphics') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
┏━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ ┃ Name ┃ Type ┃ Params ┃ Mode ┃ FLOPs ┃
┡━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ 0 │ model │ LaiaCRNN │ 5.3 M │ train │ 0 │
│ 1 │ criterion │ CTCLoss │ 0 │ train │ 0 │
└───┴───────────┴──────────┴────────┴───────┴───────┘
Trainable params: 5.3 M
Non-trainable params: 0
Total params: 5.3 M
Total estimated model params size (MB): 21
Modules in train mode: 25
Modules in eval mode: 0
Total FLOPs: 0
TR - E0: 0%| | 0/2430 [00:00<?, ?it/s]<inline asm>:14:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:15 row_mask:0xa
^
<inline asm>:15:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:15 row_mask:0xa
^
<inline asm>:17:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:31 row_mask:0xc
^
<inline asm>:18:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:31 row_mask:0xc
^
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
4 errors generated.
MIOpen Error: /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/hipoc/hipoc_program.cpp:299: Code object build failed. Source: MIOpenBatchNormFwdTrainSpatial.cl
[2025-12-05 10:46:54,430 CRITICAL laia] Uncaught exception:
Traceback (most recent call last):
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/engine/engine_exception.py", line 27, in exception_catcher
yield
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/engine/engine_module.py", line 130, in training_step
batch_y_hat = self.model(batch_x)
^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/models/htr/laia_crnn.py", line 116, in forward
x = self.conv(x)
^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/container.py", line 250, in forward
input = module(input)
^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/models/htr/conv_block.py", line 81, in forward
x = self.batchnorm(x)
^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/modules/batchnorm.py", line 193, in forward
return F.batch_norm(
^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/nn/functional.py", line 2813, in batch_norm
return torch.batch_norm(
^^^^^^^^^^^^^^^^^
RuntimeError: miopenStatusUnknownError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/incognito/AI/pylaia-reloaded/.venv/bin/pylaia-htr-train-ctc", line 10, in <module>
sys.exit(main())
^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/scripts/htr/train_ctc.py", line 255, in main
run(**args)
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/scripts/htr/train_ctc.py", line 181, in run
trainer.fit(engine_module, datamodule=data_module)
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 584, in fit
call._call_and_handle_interrupt(
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 49, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 630, in _fit_impl
self._run(model, ckpt_path=ckpt_path, weights_only=weights_only)
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1123, in _run_stage
self.fit_loop.run()
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 217, in run
self.advance()
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 465, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 153, in run
self.advance(data_fetcher)
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 352, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 192, in run
self._optimizer_step(batch_idx, closure)
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 270, in _optimizer_step
call._call_lightning_module_hook(
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 177, in _call_lightning_module_hook
output = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/core/module.py", line 1368, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/core/optimizer.py", line 154, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/strategies/strategy.py", line 239, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/plugins/precision/precision.py", line 123, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/optim/optimizer.py", line 517, in wrapper
out = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/optim/optimizer.py", line 82, in _use_grad
ret = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/optim/rmsprop.py", line 157, in step
loss = closure()
^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/plugins/precision/precision.py", line 109, in _wrap_closure
closure_result = closure()
^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 146, in __call__
self._result = self.closure(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 131, in closure
step_output = self._step_fn()
^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 319, in _training_step
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 329, in _call_strategy_hook
output = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/pytorch_lightning/strategies/strategy.py", line 391, in training_step
return self.lightning_module.training_step(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/engine/htr_engine_module.py", line 39, in training_step
result = super().training_step(batch, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/engine/engine_module.py", line 129, in training_step
with self.exception_catcher(batch):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/incognito/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 158, in __exit__
self.gen.throw(value)
File "/home/incognito/AI/pylaia-reloaded/.venv/lib/python3.12/site-packages/laia/engine/engine_exception.py", line 29, in exception_catcher
raise EngineException(
laia.engine.engine_exception.EngineException: Exception "RuntimeError('miopenStatusUnknownError')" raised during epoch=0, global_step=0 with batch=['train/BL_Add_MS_21581_113_line_023', 'train/Hebrew_Samaritan_wu1y3qt9c6dgp.13_line_017', 'train/BL_Or_6461_109_line_013', 'train/CUL_MS_Add._713_0067_b_line_012', 'train/HCL_Ms._Sam._1_0097_a_line_009', 'train/BL_Or_6461_104_line_000', 'train/UBL_Vollers_1120_252_line_011', 'train/CTC_R.15.54_030_line_020', 'train/BNF_Sam.2_221_line_026', 'train/CBL_Ms._Heb_752_217_line_002', 'train/BL_Or_12269_059_line_025', 'train/HCL_Ms._Sam._1_0168_b_line_006', 'train/CUL_MS_Add._713_0192_a_line_005', 'train/NLI_Ms._Sam._86_382_line_024', 'train/NLR_Sam._IIA_64_027a_line_011', 'train/CUL_MS_Add._713_0185_a_line_007']
TR - E0: 0%| | 0/2430 [00:00<?, ?it/s]
Problematic code:
# Add Batch normalization
self.batchnorm = nn.BatchNorm2d(out_channels) if batchnorm else None
Simple code to reproduce the error:
import torch
import torch.nn as nn
# Make sure you're on ROCm and have a GPU available
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))
# Define a simple BatchNorm2d layer
bn = nn.BatchNorm2d(16).cuda()
# Create a random input tensor on GPU
x = torch.randn(8, 16, 32, 32, device="cuda") # (N, C, H, W)
# Forward pass (this is where MIOpen may crash)
y = bn(x)
print("Output shape:", y.shape)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels