Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in supported interconnects #63

Open
ocaisa opened this issue Feb 1, 2021 · 6 comments
Open

Regression in supported interconnects #63

ocaisa opened this issue Feb 1, 2021 · 6 comments

Comments

@ocaisa
Copy link
Member

ocaisa commented Feb 1, 2021

I was looking at the UCX configuration in 2020.12 and I noticed that it looks like we have a regression. From EESSI/compatibility-layer#49 (comment) it looks like we should have a UCX configuration like

configure: =========================================================
configure: UCX build configuration:
configure:       Build prefix:   /home/bob/ucx/inst
configure: Preprocessor flags:   -DCPU_FLAGS="|avx" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src
configure:         C compiler:   x86_64-pc-linux-gnu-gcc -O3 -g -Wall -Werror -mavx
configure:       C++ compiler:   x86_64-pc-linux-gnu-g++ -O3 -g -Wall -Werror -mavx
configure:       Multi-thread:   enabled
configure:          MPI tests:   disabled
configure:      Devel headers:   no
configure:           Bindings:   < >
configure:        UCT modules:   < ib rdmacm cma >
configure:       CUDA modules:   < >
configure:       ROCM modules:   < >
configure:         IB modules:   < >
configure:        UCM modules:   < >
configure:       Perf modules:   < >
configure: =========================================================

but in the build log for UCX (on Zen2) I see

configure: =========================================================
configure: UCX build configuration:
configure:       Build prefix:   /cvmfs/pilot.eessi-hpc.org/2020.12/software/x86_64/amd/zen2/software/UCX/1.8.0-GCCcore-9.3.0
configure: Preprocessor flags:   -DCPU_FLAGS="|avx" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src
configure:         C compiler:   gcc -O3 -g -Wall -Werror -mavx
configure:       C++ compiler:   g++ -O3 -g -Wall -Werror -mavx
configure:       Multi-thread:   enabled
configure:          MPI tests:   disabled
configure:      Devel headers:   no
configure:           Bindings:   < >
configure:        UCT modules:   < ib cma >
configure:       CUDA modules:   < >
configure:       ROCM modules:   < >
configure:         IB modules:   < >
configure:        UCM modules:   < >
configure:       Perf modules:   < >
configure: =========================================================

(note the missing rdmacm)

We should probably explicitly insert what we expect from the final build (--with-rdmacm) so that configure will fail rather than build regardless. UCX in particular is critical to the stack so could do with additonal checks.

@ocaisa
Copy link
Member Author

ocaisa commented Feb 1, 2021

To get it get it to use rdma inside the prefix layer I needed to explicitly provide the path:

configopts = '--enable-optimizations --enable-cma --enable-mt --with-verbs --with-rdmacm=/cvmfs/pilot.eessi-hpc.org/2020.12/compat/linux/x86_64/usr --with-sysroot=/cvmfs/pilot.eessi-hpc.org/2020.12/compat/linux/x86_64'

Just setting --with-sysroot is not enough. This is probably why on some archs you might get the support and on others not, it depends on what is on the host.

@ocaisa
Copy link
Member Author

ocaisa commented Feb 1, 2021

This is not really an issue with the prefix layer, I'm going to move it

@ocaisa ocaisa transferred this issue from another repository Feb 1, 2021
@ocaisa
Copy link
Member Author

ocaisa commented Feb 1, 2021

I also checked libfabric and I see in the configure there

checking for sysroot... no

which I would also have suspicions about.

@bedroge
Copy link
Collaborator

bedroge commented Feb 10, 2021

I checked our 2020.10 installation, and it has the same issue / configuration output. The output in the comment at EESSI/compatibility-layer#49 (comment) is from a manual UCX installation where I indeed explicitly passed --with-rdmacm to the configure, so we should somehow pass this to our UCX installation as well (using a hook?).

@bedroge
Copy link
Collaborator

bedroge commented Feb 10, 2021

I see that the configure of both libfabric and UCX allow a --with-sysroot flag:

  --with-sysroot=DIR Search for dependent libraries within DIR
                        (or the compiler's sysroot if not specified).

As the compiler has been configured with --with-sysroot set to the prefix, I assume we don't necessarily have to use this flag for these packages.

@bedroge
Copy link
Collaborator

bedroge commented Apr 7, 2021

This has been fixed in 2021.03:

configure:        UCT modules:   < ib rdmacm cma >

We should still have some (ReFrame?) test for this to make sure that UCX is always correctly configured in future versions, though, so let's leave this issue open to not forget about this.

trz42 referenced this issue in trz42/software-layer Mar 18, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Sync branch for rebuilding `generic` with branch for NESSI/2022.11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants