Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
- Ubuntu distribution
- AWS distribution (EFA)
Please describe the system on which you are running
- Operating system/version: Amazon Linux 2023, Ubuntu 24
- Computer hardware: c7gn.16xlarge (and other Graviton2 or Graviton3 instance types)
- Network type: ENI (TCP)
Summary
We're seeing intermittent deadlocks in multi-threaded MPI applications on AWS Graviton that never occur on x86_64. All threads end up stuck in ompi_sync_wait_mt(). After analyzing the code, I believe this is due to missing memory barriers in the ARM64 atomic operations used by the wait_sync mechanism.
Environment
- Open MPI: v5.0.8 (also tested v4.1.6, same issue)
- Platform: AWS Graviton2 or Graviton3 (ARM64/aarch64)
- Compiler: GCC -O2
- Application: Multi-threaded MPI with concurrent operations from multiple threads
What we observe
When the deadlock occurs, backtraces show all ranks stuck in ompi_sync_wait_mt(). Some threads are spinning in opal_progress(), others are blocked on condition variables. The affected MPI operations include MPI_Ssend, MPI_Mprobe, and MPI_Waitall.
Unfortunately there is no known minimal reproducer that triggers the deadlock deterministically. The issue appears intermittently at production-scale workloads with high thread contention. It never happens on x86_64 with the same code.
Analysis
Looking at the code, ompi_sync_wait_mt() implements a counting semaphore pattern:
The completion thread decrements sync->count (wait_sync.h:144):
if (0 != (OPAL_THREAD_ADD_FETCH32(&sync->count, -updates))) {
return;
}
WAIT_SYNC_SIGNAL(sync);
The waiting thread polls it (wait_sync.c:118-120):
OPAL_THREAD_ADD_FETCH32(&num_thread_in_progress, 1);
while (sync->count > 0) {
opal_progress();
}
The problem is in opal/include/opal/sys/arm64/atomic.h. The OPAL_ASM_MAKE_ATOMIC macro (lines 286-295) uses ldxr/stxr which are atomic but lack acquire/release semantics. Additionally, the while (sync->count > 0) is a plain C read with no acquire semantics.
On ARM64's weak memory model, this means:
- The completion thread's decrement might not be visible to the waiting thread
- The waiting thread might read a stale cached value
Interestingly, the same file has opal_atomic_swap_* (lines 247-258) which correctly uses ldaxr/stlxr with acquire/release semantics. So there's an inconsistency.
History
This appears to have been introduced in commit 7893248 (Nov 2017) when fetch-and-op atomics were added. It affects all v4.x and v5.x releases. There was a partial fix attempt in commit 5e13f02 (Jan 2021) that added atomic_llsc.h with proper barriers, but the OPAL_ASM_MAKE_ATOMIC macro wasn't updated.
Proposed fix
I think both of these changes are needed:
- Update OPAL_ASM_MAKE_ATOMIC to use ldaxr/stlxr (matching opal_atomic_swap_*)
- Change the plain read to use an atomic load with acquire semantics
References
- Commit 7893248 (introduced the issue)
- Commit 5e13f02 (partial fix attempt)
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Please describe the system on which you are running
Summary
We're seeing intermittent deadlocks in multi-threaded MPI applications on AWS Graviton that never occur on x86_64. All threads end up stuck in
ompi_sync_wait_mt(). After analyzing the code, I believe this is due to missing memory barriers in the ARM64 atomic operations used by the wait_sync mechanism.Environment
What we observe
When the deadlock occurs, backtraces show all ranks stuck in
ompi_sync_wait_mt(). Some threads are spinning inopal_progress(), others are blocked on condition variables. The affected MPI operations include MPI_Ssend, MPI_Mprobe, and MPI_Waitall.Unfortunately there is no known minimal reproducer that triggers the deadlock deterministically. The issue appears intermittently at production-scale workloads with high thread contention. It never happens on x86_64 with the same code.
Analysis
Looking at the code,
ompi_sync_wait_mt()implements a counting semaphore pattern:The completion thread decrements
sync->count(wait_sync.h:144):The waiting thread polls it (wait_sync.c:118-120):
The problem is in
opal/include/opal/sys/arm64/atomic.h. TheOPAL_ASM_MAKE_ATOMICmacro (lines 286-295) usesldxr/stxrwhich are atomic but lack acquire/release semantics. Additionally, thewhile (sync->count > 0)is a plain C read with no acquire semantics.On ARM64's weak memory model, this means:
Interestingly, the same file has
opal_atomic_swap_*(lines 247-258) which correctly usesldaxr/stlxrwith acquire/release semantics. So there's an inconsistency.History
This appears to have been introduced in commit 7893248 (Nov 2017) when fetch-and-op atomics were added. It affects all v4.x and v5.x releases. There was a partial fix attempt in commit 5e13f02 (Jan 2021) that added atomic_llsc.h with proper barriers, but the OPAL_ASM_MAKE_ATOMIC macro wasn't updated.
Proposed fix
I think both of these changes are needed:
References