Replies: 2 comments
-
this is the error it throws up with any distributed training enabled with deepspeed installed or not |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Managed to build a Windows installable whl hopefully with sparse attention support enabled from the slapped together triton whl I mentioned in another post I'm offering up in hopes that it'll attract the attention of someone with more coding experience. It installs and doesn't break anything else in my setup except for accelerate which doesn't seem to be able to process a yes for enabling deepspeed answer from a windows machine and crashes the process. I was able to get around this using a config file from WSL where it goes thru the whole setup process and dropping it in the huggingface cache folder replacing the old one but I've reached a new problem. Anytime I try to launch any sort of distributed training, even from a fresh barebones install of automatic1111 stable diffusion prior to installing anything but its basic requirements, with only multi-gpu enabled I get warnings on launch I've traced back to accelerate still requesting NCCL when launching torch. Am I missing something here or is distributed training of any kind just not supported on Windows at all? Here's the whl for deepspeed https://transfer.sh/eDLOMJ/deepspeed-0.8.0+cd271a4a-cp310-cp310-win_amd64.whl
and here's the whl for the triton it's using https://transfer.sh/me0xpC/triton-2.0.0-cp310-cp310-win_amd64.whl
Pip nstall triton by its full filename, copy the folder it creates from wherever it installs, into an xformers folder hopefully in the same parent folder and replace any conflicting files. The pip install the deepspeed whl. Rest is up to y'all 🤷🏻♂️
Beta Was this translation helpful? Give feedback.
All reactions