Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: Segmentation fault on image generation start (AMD) #3967

Open
1 task done
redhelling21 opened this issue Jul 24, 2023 · 25 comments
Open
1 task done

[bug]: Segmentation fault on image generation start (AMD) #3967

redhelling21 opened this issue Jul 24, 2023 · 25 comments
Labels
bug Something isn't working

Comments

@redhelling21
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

OS

Linux

GPU

amd

VRAM

8GB

What version did you experience this issue on?

3.0.0

What happened?

I tried to install via the automated installer and the manual installation. No matter what I try, when I click on the "Invoke" button on the web GUI, I get a segmentation fault :

$ invokeai --web
[2023-07-24 23:32:06,280]::[InvokeAI]::INFO --> Patchmatch initialized
/home/hellong/.venv/lib/python3.10/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be removed in 0.17. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
warnings.warn(
INFO: Started server process [18287]
INFO: Waiting for application startup.
[2023-07-24 23:32:06,661]::[InvokeAI]::INFO --> InvokeAI version 3.0.0
[2023-07-24 23:32:06,661]::[InvokeAI]::INFO --> Root directory = /home/hellong/invokeai
[2023-07-24 23:32:06,662]::[InvokeAI]::INFO --> GPU device = cuda AMD Radeon RX 6700 XT
[2023-07-24 23:32:06,664]::[InvokeAI]::INFO --> Scanning /home/hellong/invokeai/models for new models
[2023-07-24 23:32:06,857]::[InvokeAI]::INFO --> Scanned 5 files and directories, imported 0 models
[2023-07-24 23:32:06,859]::[InvokeAI]::INFO --> Model manager service initialized
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:9090 (Press CTRL+C to quit)
INFO: 127.0.0.1:35052 - "GET /socket.io/?EIO=4&transport=polling&t=Oc9qHwH HTTP/1.1" 200 OK
INFO: 127.0.0.1:35052 - "POST /socket.io/?EIO=4&transport=polling&t=Oc9qHwJ&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK
INFO: 127.0.0.1:35052 - "GET /socket.io/?EIO=4&transport=polling&t=Oc9qHwK&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK
INFO: ('127.0.0.1', 35066) - "WebSocket /socket.io/?EIO=4&transport=websocket&sid=ZXwRuIab-6GgOo1cAAAA" [accepted]
INFO: connection open
INFO: 127.0.0.1:35052 - "GET /socket.io/?EIO=4&transport=polling&t=Oc9qHwM&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK
INFO: 127.0.0.1:35052 - "POST /socket.io/?EIO=4&transport=polling&t=Oc9qHwW&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK
INFO: 127.0.0.1:35052 - "GET /socket.io/?EIO=4&transport=polling&t=Oc9qHx3&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK
INFO: 127.0.0.1:35052 - "GET /socket.io/?EIO=4&transport=polling&t=Oc9qHx5&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK
INFO: 127.0.0.1:35052 - "POST /api/v1/sessions/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35052 - "PUT /api/v1/sessions/50d99cec-2fc6-4e59-9219-f7e9d0dbf159/invoke?all=true HTTP/1.1" 202 Accepted
[2023-07-24 23:32:13,517]::[InvokeAI]::INFO --> Loading model /home/hellong/invokeai/models/sd-1/main/stable-diffusion-v1-5, type sd-1:main:tokenizer
[2023-07-24 23:32:13,747]::[InvokeAI]::INFO --> Loading model /home/hellong/invokeai/models/sd-1/main/stable-diffusion-v1-5, type sd-1:main:text_encoder
Segmentation fault (core dumped)

Screenshots

No response

Additional context

Using ROCm 5.4.2, as recommended by the Pytorch official website.
GPU : AMD Radeon 6700 XT

Contact Details

No response

@redhelling21 redhelling21 added the bug Something isn't working label Jul 24, 2023
@redhelling21 redhelling21 changed the title [bug]: Segmentation fault on starting image generation [bug]: Segmentation fault on image generation start (AMD) Jul 24, 2023
@puresick
Copy link

puresick commented Jul 26, 2023

Same happening to me with an AMD Radeon 5500 XT with 8GB of VRAM.

Something similar has also happening to me pre-3.0, but that issue has been closed since the open issues got reset with the 3.0 release: #2894 (comment)

@tokenwizard
Copy link

I'm also having this issue. When you click the Invoke button, about 5-10 seconds later the console shows the Seg Fault.

Freshly installed using the install script on Linux and using the Analog-Diffusion model.

System Specs are below.

Here is potentially relevant dmesg output:

[Wed Jul 26 08:23:30 2023] invokeai-web[1009479]: segfault at 20 ip 00007f4e27ab40a7 sp 00007f4b5dff9290 error 4 in libamdhip64.so[7f4e27a00000+3f3000] likely on CPU 12 (core 4, socket 0)
[Wed Jul 26 08:23:30 2023] Code: 8d 15 5d 6d 25 00 48 8d 3d f6 6c 25 00 be 32 00 00 00 e8 dc ed 1f 00 e8 c7 ed 1f 00 48 8b 45 b8 48 8b 50 28 4c 8b 24 da 31 c0 <41> 80 7c 24 20 00 74 11 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d
[Wed Jul 26 08:59:06 2023] invokeai-web[1012831]: segfault at 20 ip 00007fbf1fcb40a7 sp 00007fbc55ff9290 error 4 in libamdhip64.so[7fbf1fc00000+3f3000] likely on CPU 9 (core 1, socket 0)
[Wed Jul 26 08:59:06 2023] Code: 8d 15 5d 6d 25 00 48 8d 3d f6 6c 25 00 be 32 00 00 00 e8 dc ed 1f 00 e8 c7 ed 1f 00 48 8b 45 b8 48 8b 50 28 4c 8b 24 da 31 c0 <41> 80 7c 24 20 00 74 11 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d

image

@arvenig
Copy link

arvenig commented Jul 26, 2023

I appear to have been experiencing this issue too, Linux, Radeon 6900XT. Hopefully relevent detail is that I was able to work around it by using torch version 1.13.1+rocm5.2 and corresponding torchvision 0.14.1+rocm5.2 that I still had from my working Invoke 2.3.5 install. After replacing torch 2.0 and torchvision with those older versions Invoke 3.0 now seems to work as expected for me.

@Alex9001
Copy link

I have the same problem.

OS: Artix Linux x86_64
GPU: AMD ATI Radeon RX 6600/6600 XT/6600M
CPU: AMD Ryzen 7 5800H

@arvenig
Copy link

arvenig commented Aug 2, 2023

Was experiencing this issue on my Ryzen 7950X / Radeon 6900XT desktop system running Arch Linux. I seem to have worked around it by disabling the 7950X's iGPU in BIOS. The GPU device reported by invokeai-web at startup both with and without the iGPU enabled is 'cuda AMD Radeon RX 6900 XT', but for whatever reason having the iGPU enabled seems to have been causing an issue. This issue has been present for me in all versions of invoke since the update to torch 2.0. Tested on a fresh InvokeAI 3.0.1post3 install.

@Godd67
Copy link

Godd67 commented Aug 5, 2023

Yep, same issue for me - 2.3 version worked perfectly, 3.0.1post3 (fresh install) failed with segerror
RX6600, Ubuntu 22.04, Rocm 5.6
[2023-08-05 19:00:28,561]::[uvicorn.access]::INFO --> 127.0.0.1:35828 - "PUT /api/v1/sessions/f3756076-a290-4c92-af83-28ccd8e881d4/invoke?all=true HTTP/1.1" 202
[2023-08-05 19:00:28,575]::[InvokeAI]::INFO --> Loading model /media/olegus/Extra/InvokeAi/models/sd-1/main/stable-diffusion-v1-5, type sd-1:main:tokenizer
[2023-08-05 19:00:28,873]::[InvokeAI]::INFO --> Loading model /media/olegus/Extra/InvokeAi/models/sd-1/main/stable-diffusion-v1-5, type sd-1:main:text_encoder
./invoke.sh: line 51: 99206 Segmentation fault (core dumped) invokeai-web $PARAMS

@Millu
Copy link
Contributor

Millu commented Aug 7, 2023

Hey! Another person had similar issues with torch and a fix seems to be building a version of python with a different lower torch version (similar to what @arvenig said!):

#4041 (comment)

@Godd67
Copy link

Godd67 commented Aug 10, 2023

Hey! Another person had similar issues with torch and a fix seems to be building a version of python with a different lower torch version (similar to what @arvenig said!):

#4041 (comment)

Can someone explain in simple words how to achieve it? BTW, I use Python 3.10 as it was suggested for previous InvokeAi version.

@YabbaYabbaYabba
Copy link

i have the same issue - invoke.sh: line 51: 8792 Segmentation fault (core dumped) invokeai-web $PARAMS

@Jeremi360
Copy link

Jeremi360 commented Aug 21, 2023

I have the same issue ./invoke.sh: linie 51: 4167 Segmentation fault (core dumped) invokeai-web $PARAMS

@Godd67
Copy link

Godd67 commented Aug 25, 2023

Made it work with rocm 5.4.2 and rx6600 and kernel 5.19 .
Followed this guide - https://phazertech.com/tutorials/rocm.html , starting from Other Requirements section - already had rocm installed so cant comment on this part.
It seems the only difference from my previous attempts was this -
sudo apt install nvidia-cuda-toolkit

@YabbaYabbaYabba
Copy link

Thank you!

@archer31
Copy link

archer31 commented Oct 13, 2023

Unfortunately none of the posted solutions work to resolve the segfault. What I have tried:

  • Downgrading torch and torchvision
    • This just results in the gpu not being detected anymore
  • Upgrading torch and torchvision
    • Same as above
  • Applying HSA_OVERRIDE_GFX_VERSION=10.3.0 to my profile
    • No appreciable changes

ROCM version 5.4.3
GPU: Radeon RX 7900 XTX
InvokeAI version: 3.2.0 (same also happens in 3.3.0RC1)

Edit: This appears to be an issue with ROCM support for the 7000 series of AMD GPUs. not sure why these are still unsupported 9 months after they came out. guess ill just return this card and get an nvidia gpu :(.

@adeliktas
Copy link

adeliktas commented Oct 15, 2023

I just installed in python3.11 venv InvokeAI 3.3.0 with rocm for amd 6600xt and encountered the same issue when pressing "Invoke" Button on the webui.

segfault at 20 ip 00007fd2142b40a7 sp 00007fcecfe91470 error 4 in libamdhip64.so[7fd214200000+3f3000]

pytorch-triton-rocm 2.0.2
torch 2.0.1+rocm5.4.2
torchvision 0.15.2+rocm5.4.2

.../InvokeAI/.venv/lib/python3.11/site-packages/triton/third_party/rocm/lib/libamdhip64.so
.../InvokeAI/.venv/lib/python3.11/site-packages/torch/lib/libamdhip64.so

gdb last traces


[#6] 0x7fffad3c93e4 → hipLaunchKernel()
[#7] 0x7fffaf7b3a3b → at::native::index_select_out_cuda(at::Tensor const&, long, at::Tensor const&, at::Tensor&)::{lambda()#2}::operator()() const()
[#8] 0x7fffaf791d5a → at::native::index_select_out_cuda(at::Tensor const&, long, at::Tensor const&, at::Tensor&)()
[#9] 0x7fffaf7c947b → at::native::index_select_cuda(at::Tensor const&, long, at::Tensor const&)()

@takov751
Copy link

Unfortunately none of the posted solutions work to resolve the segfault. What I have tried:

  • Downgrading torch and torchvision

    • This just results in the gpu not being detected anymore
  • Upgrading torch and torchvision

    • Same as above
  • Applying HSA_OVERRIDE_GFX_VERSION=10.3.0 to my profile

    • No appreciable changes

ROCM version 5.4.3

GPU: Radeon RX 7900 XTX

InvokeAI version: 3.2.0 (same also happens in 3.3.0RC1)

Edit: This appears to be an issue with ROCM support for the 7000 series of AMD GPUs. not sure why these are still unsupported 9 months after they came out. guess ill just return this card and get an nvidia gpu :(.

  • In your case it should be HSA_OVERRIDE_GFX_VERSION=11.0.0

@adeliktas
Copy link

adeliktas commented Oct 23, 2023

Unfortunately none of the posted solutions work to resolve the segfault. What I have tried:

  • Downgrading torch and torchvision

    • This just results in the gpu not being detected anymore
  • Upgrading torch and torchvision

    • Same as above
  • Applying HSA_OVERRIDE_GFX_VERSION=10.3.0 to my profile

    • No appreciable changes

ROCM version 5.4.3
GPU: Radeon RX 7900 XTX
InvokeAI version: 3.2.0 (same also happens in 3.3.0RC1)
Edit: This appears to be an issue with ROCM support for the 7000 series of AMD GPUs. not sure why these are still unsupported 9 months after they came out. guess ill just return this card and get an nvidia gpu :(.

* In your case it should be HSA_OVERRIDE_GFX_VERSION=11.0.0

Setting gfx made invokeai run for my 6600 XT, but generating the image bugs and returns an invalid image.
#4278
#4211
CUDA_VERSION=gfx1030 HSA_OVERRIDE_GFX_VERSION=10.3.0 invokeai-web

https://gist.github.com/adeliktas/669812e64fd356afc4648ba847c61133
torch version = 2.0.1+rocm5.4.2
cuda available = True
cuda version = None
device count = 1
cudart = <module 'torch._C._cudart'>
device = 0
capability = (10, 3)
name = AMD Radeon RX 6600 XT

@hchasens
Copy link

hchasens commented Mar 7, 2024

I'm seeing this with my 7900xtx

@hchasens
Copy link

hchasens commented Mar 7, 2024

So I figured it out. When using ROCm it tries to select your first GPU which is your integrated graphics. There's not enough VRAM so you get a segmentation fault. There's an environment variable you can use to disable the visibility of the iGPU.

export HIP_VISIBLE_DEVICES="0"

I found the best place to put it is in invokeai.sh right after the start of the venv.

. .venv/bin/activate

export INVOKEAI_ROOT="$scriptdir"
PARAMS=$@

export HIP_VISIBLE_DEVICES="0"

# Check to see if dialog is installed (it seems to be fairly standard, but good to check regardless) and if the user has passed the --no-tui argument to disable the dialog TUI

This fixed my issue. I've found a programs that have the same issue. Autogen and Text-gen-webui both have the same problem and solution.

Hope this has helped! It's a lot easier than phazertech's guide imo.

@Alex9001
Copy link

Alex9001 commented Mar 7, 2024

So I figured it out. When using ROCm it tries to select your first GPU which is your integrated graphics. There's not enough VRAM so you get a segmentation fault. There's an environment variable you can use to disable the visibility of the iGPU.

export HIP_VISIBLE_DEVICES="0"

I found the best place to put it is in invokeai.sh right after the start of the venv.

. .venv/bin/activate

export INVOKEAI_ROOT="$scriptdir"
PARAMS=$@

export HIP_VISIBLE_DEVICES="0"

# Check to see if dialog is installed (it seems to be fairly standard, but good to check regardless) and if the user has passed the --no-tui argument to disable the dialog TUI

This fixed my issue. I've found a programs that have the same issue. Autogen and Text-gen-webui both have the same problem and solution.

Hope this has helped! It's a lot easier than phazertech's guide imo.

Very based.

@adeliktas
Copy link

adeliktas commented Mar 17, 2024

after almost half a year, i've decided to give it another try and was able to find my issue after writing this.
I've tried working with different env vars like HIP_VISIBLE_DEVICES="0" and ran two test scripts

https://gist.github.com/adeliktas/669812e64fd356afc4648ba847c61133
https://gist.github.com/damico/484f7b0a148a0c5f707054cf9c0a0533

torch version = 2.2.1+rocm5.7
cuda available = True
cuda version = None
device count = 1
cudart = <module 'torch._C._cudart'>
device = 0
capability = (10, 3)
name = AMD Radeon RX 6600 XT
...
Everything fine! You can run PyTorch code inside of: 
--->  AMD Ryzen 9 3950X 16-Core Processor  
--->  gfx1032

i did print all env vars with the env command and suprisingly found that HSA_OVERRIDE_GFX_VERSION wasn't listed, even though echo $HSA_OVERRIDE_GFX_VERSION prints 10.3.0 because i set it universally with set -U HSA_OVERRIDE_GFX_VERSION 10.3.0 in fish which doesnt export it to bash and is only shared in fish
a simple export HSA_OVERRIDE_GFX_VERSION=10.3.0 solved that

PWD=/home/adeliktas/ai/invokeai_projects/InvokeAI
HSA_OVERRIDE_GFX_VERSION=10.3.0
INVOKEAI_ROOT=/home/adeliktas/ai/invokeai_projects/InvokeAI
HIP_VISIBLE_DEVICES=0
VIRTUAL_ENV_PROMPT=(InvokeAI)
_OLD_FISH_PROMPT_OVERRIDE=/home/adeliktas/ai/invokeai_projects/InvokeAI/.venv
VIRTUAL_ENV=/home/adeliktas/ai/invokeai_projects/InvokeAI/.venv
upstream InvokeAI version 4.0.0rc2 faa1ffb06fd4974c43be14a2119a1aab12b63038

@Developer-42
Copy link

So I figured it out. When using ROCm it tries to select your first GPU which is your integrated graphics. There's not enough VRAM so you get a segmentation fault. There's an environment variable you can use to disable the visibility of the iGPU.

export HIP_VISIBLE_DEVICES="0"

I found the best place to put it is in invokeai.sh right after the start of the venv.

. .venv/bin/activate

export INVOKEAI_ROOT="$scriptdir"
PARAMS=$@

export HIP_VISIBLE_DEVICES="0"

# Check to see if dialog is installed (it seems to be fairly standard, but good to check regardless) and if the user has passed the --no-tui argument to disable the dialog TUI

This fixed my issue. I've found a programs that have the same issue. Autogen and Text-gen-webui both have the same problem and solution.

Hope this has helped! It's a lot easier than phazertech's guide imo.

Sadly, this doesn't work for me with my AMD Radeon RX 7800 XT. Also, the file name is invoke.sh not invokeai.sh

@takov751
Copy link

takov751 commented Mar 28, 2024

So I figured it out. When using ROCm it tries to select your first GPU which is your integrated graphics. There's not enough VRAM so you get a segmentation fault. There's an environment variable you can use to disable the visibility of the iGPU.
export HIP_VISIBLE_DEVICES="0"
I found the best place to put it is in invokeai.sh right after the start of the venv.

. .venv/bin/activate

export INVOKEAI_ROOT="$scriptdir"
PARAMS=$@

export HIP_VISIBLE_DEVICES="0"

# Check to see if dialog is installed (it seems to be fairly standard, but good to check regardless) and if the user has passed the --no-tui argument to disable the dialog TUI

This fixed my issue. I've found a programs that have the same issue. Autogen and Text-gen-webui both have the same problem and solution.
Hope this has helped! It's a lot easier than phazertech's guide imo.

Sadly, this doesn't work for me with my AMD Radeon RX 7800 XT. Also, the file name is invoke.sh not invokeai.sh

have you specified HSA_OVERRIDE_GFX_VERSION=11.0.0 as your gpu is 7XXX series?

@Alex9001
Copy link

Alex9001 commented Apr 3, 2024

I finally got around to trying export HIP_VISIBLE_DEVICES="0" ... and nothing happened. Just as before,

::[uvicorn.access]::INFO --> 127.0.0.1:38998 - "GET /api/v1/queue/default/list HTTP/1.1" 200
./invoke.sh: line 56: 9079 Segmentation fault invokeai-web $PARAMS

@hchasens
Copy link

hchasens commented Apr 3, 2024

@Alex9001 This error message makes me think it might not be an ROCm issue. Never the less, it might be worth double checking to make sure your ROCm HIP runtime is up to date. I'm assuming the ROCm runtime is in your /opt/rocm/ folder? It might be worth checking that, along with your package manager to see if there are any updates. Use some of the tools AMD ships with the runtime to make sure it's communicating with your hardware properly (maybe using rocminfo or the like. If your GPU is supported you should see it listed

@Serpentian
Copy link

Placing export HSA_OVERRIDE_GFX_VERSION=11.0.0 right after venv activation in invoke.sh fixed the issue with AMD Radeon RX 7800 XT. Here's the source: #4211 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests