Skip to content

opal_lifo tests hangs/crashes on specific riscv64 hardware (P550) #13762

@andreabolognani

Description

@andreabolognani

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using?

main branch, commit 9479fd1

Describe how Open MPI was installed

Building from source.

Configured as follows, to closely match the Fedora package:

$ ./configure --prefix=/usr/lib64/openmpi --mandir=/usr/share/man/openmpi-riscv64 --includedir=/usr/include/openmpi-riscv64 --sysconfdir=/etc/openmpi-riscv64 --disable-silent-rules --enable-builtin-atomics --enable-ipv6 --enable-mpi1-compatibility --with-prrte=external --with-sge --with-valgrind --enable-memchecker --with-hwloc=/usr --with-libevent=external --with-pmix=external --disable-sphinx

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
 53fce423d5d6b25798ed1f32837671dc55d0230d 3rd-party/openpmix (v5.0.10rc1-59-g53fce423)
 2d9b0aaaeea49a0e7850aed95e5ace9340c7d847 3rd-party/prrte (psrvr-v2.0.0rc1-5190-g2d9b0aaaee)
 6032f68dd9636b48977f59e986acc01a746593a6 3rd-party/pympistandard (remotes/origin/main-23-g6032f68)
 3064f7bd191b49a5a5554170ef7be4762246b5ee config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: Fedora 43 (unofficial RISC-V port)
  • Computer hardware: SiFive HiFive Premier P550

Details of the problem

Open MPI builds fine on riscv64 and seems to work for the most part, but there is one specific test case which crashes/hangs fairly reliably on my machine.

Example of a crash with little information displayed:

$ ./test/class/opal_lifo
Single thread test. Time: 0 s 11059 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 54061 us 54 nsec/poppush
[p550:728144] *** Process received signal ***
Segmentation fault         (core dumped) ./test/class/opal_lifo

Another crash, this time the output is more verbose:

$ ./test/class/opal_lifo
Single thread test. Time: 0 s 11031 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 54044 us 54 nsec/poppush
[p550:728100] *** Process received signal ***
[p550:728100] Signal: Segmentation fault (11)
[p550:728100] Signal code: Address not mapped (1)
[p550:728100] Failing at address: 0x20
[p550:728100] [ 0] linux-vdso.so.1(__vdso_rt_sigreturn+0x0) [0x7fff9715e800]
[p550:728100] [ 1] /home/abologna/src/upstream/openmpi/test/class/.libs/opal_lifo() [0x109a8]
[p550:728100] [ 2] /lib64/lp64d/libc.so.6(+0x507bc) [0x7fff96aa47bc]
[p550:728100] [ 3] /lib64/lp64d/libc.so.6(+0xa77a4) [0x7fff96afb7a4]
[p550:728100] *** End of error message ***
Segmentation fault         (core dumped) ./test/class/opal_lifo

Sometimes the test just hangs and I have to kill it after a while:

$ ./test/class/opal_lifo
Single thread test. Time: 0 s 11064 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 54065 us 54 nsec/poppush
Atomics thread finished. Time: 6 s 34713 us 6034 nsec/poppush
Atomics thread finished. Time: 6 s 67842 us 6067 nsec/poppush
Atomics thread finished. Time: 6 s 470528 us 6470 nsec/poppush
Atomics thread finished. Time: 6 s 475069 us 6475 nsec/poppush
Atomics thread finished. Time: 6 s 522750 us 6522 nsec/poppush
Atomics thread finished. Time: 6 s 742323 us 6742 nsec/poppush
Atomics thread finished. Time: 6 s 806098 us 6806 nsec/poppush
Atomics thread finished. Time: 6 s 810931 us 6810 nsec/poppush
^C

Unfortunately, the backtrace obtained by running the test under gdb doesn't seem very useful, at least to my eye:

(gdb) r
Starting program: /home/abologna/src/upstream/openmpi/test/class/.libs/opal_lifo
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/lp64d/libthread_db.so.1".
Single thread test. Time: 0 s 10829 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 53356 us 53 nsec/poppush
[New Thread 0x7ffff765a0a0 (LWP 728324)]
[New Thread 0x7ffff6e590a0 (LWP 728325)]
[New Thread 0x7ffff66580a0 (LWP 728326)]
[New Thread 0x7ffff5e570a0 (LWP 728327)]
[New Thread 0x7ffff56560a0 (LWP 728328)]
[New Thread 0x7ffff4e550a0 (LWP 728329)]
[New Thread 0x7ffff46540a0 (LWP 728330)]
[New Thread 0x7ffff3e530a0 (LWP 728331)]

Thread 5 "opal_lifo" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5e570a0 (LWP 728327)]
0x00000000000109a8 in thread_test ()
(gdb) t a a bt

Thread 9 (Thread 0x7ffff3e530a0 (LWP 728331) "opal_lifo"):
#0  0x00000000000109a2 in thread_test ()
#1  0x00007ffff791c7bc in start_thread (arg=<optimized out>) at pthread_create.c:448
#2  0x00007ffff79737a4 in __thread_start_clone3 () at ../sysdeps/unix/sysv/linux/riscv/clone3.S:71

Thread 8 (Thread 0x7ffff46540a0 (LWP 728330) "opal_lifo"):
#0  0x00000000000109a8 in thread_test ()
#1  0x00007ffff791c7bc in start_thread (arg=<optimized out>) at pthread_create.c:448
#2  0x00007ffff79737a4 in __thread_start_clone3 () at ../sysdeps/unix/sysv/linux/riscv/clone3.S:71


Thread 7 (Thread 0x7ffff4e550a0 (LWP 728329) "opal_lifo"):
#0  0x00000000000109a8 in thread_test ()
#1  0x00007ffff791c7bc in start_thread (arg=<optimized out>) at pthread_create.c:448
#2  0x00007ffff79737a4 in __thread_start_clone3 () at ../sysdeps/unix/sysv/linux/riscv/clone3.S:71

Thread 6 (Thread 0x7ffff56560a0 (LWP 728328) "opal_lifo"):
#0  0x00000000000109a8 in thread_test ()
#1  0x00007ffff791c7bc in start_thread (arg=<optimized out>) at pthread_create.c:448
#2  0x00007ffff79737a4 in __thread_start_clone3 () at ../sysdeps/unix/sysv/linux/riscv/clone3.S:71

Thread 5 (Thread 0x7ffff5e570a0 (LWP 728327) "opal_lifo"):
#0  0x00000000000109a8 in thread_test ()
#1  0x00007ffff791c7bc in start_thread (arg=<optimized out>) at pthread_create.c:448
#2  0x00007ffff79737a4 in __thread_start_clone3 () at ../sysdeps/unix/sysv/linux/riscv/clone3.S:71

Thread 4 (Thread 0x7ffff66580a0 (LWP 728326) "opal_lifo"):
#0  0x00000000000109a8 in thread_test ()
#1  0x00007ffff791c7bc in start_thread (arg=<optimized out>) at pthread_create.c:448
#2  0x00007ffff79737a4 in __thread_start_clone3 () at ../sysdeps/unix/sysv/linux/riscv/clone3.S:71

Thread 3 (Thread 0x7ffff6e590a0 (LWP 728325) "opal_lifo"):
#0  0x00000000000109d8 in thread_test ()
#1  0x00007ffff791c7bc in start_thread (arg=<optimized out>) at pthread_create.c:448
#2  0x00007ffff79737a4 in __thread_start_clone3 () at ../sysdeps/unix/sysv/linux/riscv/clone3.S:71

Thread 2 (Thread 0x7ffff765a0a0 (LWP 728324) "opal_lifo"):
#0  0x00000000000109e2 in thread_test ()
#1  0x00007ffff791c7bc in start_thread (arg=<optimized out>) at pthread_create.c:448
#2  0x00007ffff79737a4 in __thread_start_clone3 () at ../sysdeps/unix/sysv/linux/riscv/clone3.S:71

Thread 1 (Thread 0x7ffff765b6e0 (LWP 728322) "opal_lifo"):
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/riscv/syscall_cancel.S:54
#1  0x00007ffff7919df8 in __internal_syscall_cancel (a1=a1@entry=140737344020848, a2=<optimized out>, a3=<optimized out>, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=4294967295, nr=nr@entry=98) at cancellation.c:49
#2  0x00007ffff791a0a0 in __futex_abstimed_wait_common64 (private=128, futex_word=0x7ffff765a170, expected=<optimized out>, op=265, abstime=0x0, cancel=true) at futex-internal.c:57
#3  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7ffff765a170, expected=<optimized out>, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128, cancel=true) at futex-internal.c:87
#4  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff765a170, expected=<optimized out>, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at futex-internal.c:139
#5  0x00007ffff791e12a in __pthread_clockjoin_ex (threadid=140737344020640, thread_return=0x7fffffffe8f0, clockid=0, abstime=0x0, block=<optimized out>) at pthread_join_common.c:108
#6  0x00007ffff7f05cda in opal_thread_join () from /home/abologna/src/upstream/openmpi/opal/.libs/libopen-pal.so.0
#7  0x0000000000010736 in main ()

This issue occasionally affects the Fedora RISC-V build infrastructure as well: if you look at the list of builds for the package you will see that a couple of them ended up in failure because of this very error.

Notably, the failed builds were all running on P550 hardware like mine, while the successful ones were on different, slower boards. My own StarFive VisionFive 2 is not set up for building packages right now, but I'll be sure to make an attempt there as soon as possible and report back. So it seems possible that the issue only manifests on this specific hardware for some reason, or maybe the fact that it's faster just results in it being hit more consistently.

That's all the information I can provide right now, but of course I'm happy to answer any questions and perform any tests that might be useful to track down the problem.

Cheers!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions