Skip to content

Intermittent deadlock in ompi_sync_wait_mt on ARM64 (Graviton) #13761

@blkqi

Description

@blkqi

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

  • v4.1.6
  • v5.0.6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

  • Ubuntu distribution
  • AWS distribution (EFA)

Please describe the system on which you are running

  • Operating system/version: Amazon Linux 2023, Ubuntu 24
  • Computer hardware: c7gn.16xlarge (and other Graviton2 or Graviton3 instance types)
  • Network type: ENI (TCP)

Summary

We're seeing intermittent deadlocks in multi-threaded MPI applications on AWS Graviton that never occur on x86_64. All threads end up stuck in ompi_sync_wait_mt(). After analyzing the code, I believe this is due to missing memory barriers in the ARM64 atomic operations used by the wait_sync mechanism.

Environment

  • Open MPI: v5.0.8 (also tested v4.1.6, same issue)
  • Platform: AWS Graviton2 or Graviton3 (ARM64/aarch64)
  • Compiler: GCC -O2
  • Application: Multi-threaded MPI with concurrent operations from multiple threads

What we observe

When the deadlock occurs, backtraces show all ranks stuck in ompi_sync_wait_mt(). Some threads are spinning in opal_progress(), others are blocked on condition variables. The affected MPI operations include MPI_Ssend, MPI_Mprobe, and MPI_Waitall.

Unfortunately there is no known minimal reproducer that triggers the deadlock deterministically. The issue appears intermittently at production-scale workloads with high thread contention. It never happens on x86_64 with the same code.

Analysis

Looking at the code, ompi_sync_wait_mt() implements a counting semaphore pattern:

The completion thread decrements sync->count (wait_sync.h:144):

if (0 != (OPAL_THREAD_ADD_FETCH32(&sync->count, -updates))) {
    return;
}
WAIT_SYNC_SIGNAL(sync);

The waiting thread polls it (wait_sync.c:118-120):

OPAL_THREAD_ADD_FETCH32(&num_thread_in_progress, 1);
while (sync->count > 0) {
    opal_progress();
}

The problem is in opal/include/opal/sys/arm64/atomic.h. The OPAL_ASM_MAKE_ATOMIC macro (lines 286-295) uses ldxr/stxr which are atomic but lack acquire/release semantics. Additionally, the while (sync->count > 0) is a plain C read with no acquire semantics.

On ARM64's weak memory model, this means:

  • The completion thread's decrement might not be visible to the waiting thread
  • The waiting thread might read a stale cached value

Interestingly, the same file has opal_atomic_swap_* (lines 247-258) which correctly uses ldaxr/stlxr with acquire/release semantics. So there's an inconsistency.

History

This appears to have been introduced in commit 7893248 (Nov 2017) when fetch-and-op atomics were added. It affects all v4.x and v5.x releases. There was a partial fix attempt in commit 5e13f02 (Jan 2021) that added atomic_llsc.h with proper barriers, but the OPAL_ASM_MAKE_ATOMIC macro wasn't updated.

Proposed fix

I think both of these changes are needed:

  1. Update OPAL_ASM_MAKE_ATOMIC to use ldaxr/stlxr (matching opal_atomic_swap_*)
  2. Change the plain read to use an atomic load with acquire semantics

References

  • Commit 7893248 (introduced the issue)
  • Commit 5e13f02 (partial fix attempt)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions