-
Notifications
You must be signed in to change notification settings - Fork 10
{2023.06}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 + eb_hooks.py #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
New job on instance
|
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
New job on instance
|
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
New job on instance
|
e877c9e
to
5befb75
Compare
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
New job on instance
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/generic |
New job on instance
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/generic accel:nvidia/cc90 |
New job on instance
|
The failure is:
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/generic accel:nvidia/cc90 |
New job on instance
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen3 accel:nvidia/cc80 |
New job on instance
|
bot: help |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
bot: help |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Label |
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen3 accel:nvidia/cc80 |
Updates by the bot instance
|
New job on instance
|
Label |
The build process is fails because of permission issues as show in the log:
|
Oh, crap, that's probably because of this bug that I introduced and which @ocaisa discovered (easybuilders/easybuild-framework#4959). Not sure if we can easily solve that without patching the EasyBuild installation... |
bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software arch:zen3 accel:nvidia/cc80 |
Updates by the bot instance
|
New job on instance
|
Error on the Ghent bot
|
bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
bot: help instance:eessi-bot-vsc-ugent |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software arch:zen3 accel:nvidia/cc80 |
Updates by the bot instance
|
New job on instance
|
@TopRichard fixed the failing test-suite for the UGent bot and you can trigger that bot as well |
) | ||
|
||
ec['buildopts'] = [ | ||
'--linkopt=-Wl,--disable-new-dtags --host_linkopt=-Wl,--disable-new-dtags --action_env=GCC_HOST_COMPILER_PATH=$EBROOTGCC/bin/gcc --host_action_env=GCC_HOST_COMPILER_PATH=$EBROOTGCC/bin/gcc --linkopt=-Wl,-rpath,$EBROOTCUDA/lib:$EBROOTCUDNN/lib:$EBROOTNCCL/lib --host_linkopt=-Wl,-rpath,$EBROOTCUDA/lib:$EBROOTCUDNN/lib:$EBROOTNCCL/lib', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look correct to me, why aren't you telling it to use our rpath wrappers, e.g., $(which gcc)
?
) | ||
|
||
if has_cuda: | ||
ec['preconfigopts'] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ec['preconfigopts'] = ( | |
ec['preconfigopts'] = ( |
What if these are already set?
'sed -i \'s|--define=PREFIX=/usr|--define=PREFIX=\\$EESSI_EPREFIX|g\' .bazelrc && ' | ||
) | ||
|
||
ec['buildopts'] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, what if they are already set?
|
||
ec['postinstallcmds'] = [ | ||
'mkdir -p %(installdir)s/bin', | ||
'ln -s $EBROOTCUDA/bin/cuobjdump %(installdir)s/bin/cuobjdump', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this, why is it necessary?
This PR uses a CUDA-ARM patch to workaround the previously seen error:
On x86_64 with cc80:
CPU tests:
Executed 847 out of 847 tests: 847 tests pass.
GPU tests
Executed 189 out of 189 tests: 189 tests pass.
On aarch64 with cc90 :
CPU tests:
Executed 847 out of 847 tests: 847 tests pass.
GPU tests
Executed 189 out of 189 tests: 188 tests pass and 1 fails locally