Skip to content

Conversation

maia-s
Copy link
Contributor

@maia-s maia-s commented Aug 28, 2025

This PR adds relaxed versions of all the atomic functions to SDL. This is useful for #13806 and also for apps in general for when you want to access an atomic without any other synchronization, which can be faster (e.g. on ARM). Relaxed functions are currently implemented for GCC, Clang and MSVC (ARM only), and fall back to the regular synced version if relaxed atomics aren't available on the current platform.

This hardcodes relaxed memory ordering, similar to how the current functions hardcode seqcst. As an alternative to this, we could add a memory ordering enum and functions that take that as an argument. That'd let users use acquire and release ordering too, which is often desirable.

@maia-s maia-s force-pushed the relaxed-atomics branch 2 times, most recently from 743b6ff to ffc4d29 Compare August 28, 2025 13:27
@maia-s maia-s force-pushed the relaxed-atomics branch 2 times, most recently from ed8d14b to 87aeb75 Compare August 28, 2025 13:57
@slouken
Copy link
Collaborator

slouken commented Aug 28, 2025

I'm not sure these are worth adding a whole bunch of API entry points to SDL. @icculus, thoughts?

@maia-s
Copy link
Contributor Author

maia-s commented Aug 28, 2025

Adding the memory ordering as arguments would be more generally useful to apps, but of course we'd have to add the enum too for that, and some logic for MSVC to choose the right functions

In general you usually want to use Acquire/Release for atomics, but it doesn't make a difference on x86 since strong ordering comes for free there

@icculus
Copy link
Collaborator

icculus commented Aug 28, 2025

@icculus, thoughts?

Isn't a large part of the reason for atomics to guarantee memory ordering? Help me understand the value of this.

@maia-s
Copy link
Contributor Author

maia-s commented Aug 28, 2025

Yes, but the issue is that SDL currently only allows for one kind of memory ordering, the strongest one (sequentially consistent ordering, or SeqCst for short). This isn't really an issue on X86/X86-64, because those processors do SeqCst automatically, and so using SeqCst atomics doesn't cost anything extra on X86, but it can make a big difference on ARM/ARM64 and other processors. (Apple's M-processors have a feature to enable automatic SeqCst ordering in Rosetta 2 to make X86-64 emulation more efficient, but that can't be enabled for normal code AFAIK.)

Memory ordering is easiest to explain if you think about atomics as synchronizing TWO pieces of data:

  1. The atomic variable itself
  2. Other data. This is further split into:
    2a. Data written to before the atomic is accessed
    2b. Data read from after the atomic is accessed

The atomic variable itself is always synchronized, and you'll never get a partially synchronized value. Once a value is stored to an atomic variable, you'll get that value when you load that atomic variable in the same or another thread. This is the same for all memory orderings. Note that this does not on its own guarantee ordering of anything else with respect to that atomic, not even other atomics.

The SeqCst ordering synchronizes everything. This is the only ordering SDL supports today. When you access an atomic with SeqCst, it acts as a total memory barrier. This costs nothing more than usual on X86 as explained above, but can be expensive on other archs like ARM.

At the other extreme is the Relaxed ordering (I'll capitalize the orderings to distinguish them from the regular english words). Relaxed ordering synchronizes ONLY the atomic variable itself. Other memory is not synchronized at all. On ARM64, this makes a big difference: Accessing a Relaxed atomic is exactly as efficient as accessing normal memory (on 32-bit ARM I think it's slightly less efficient, but still better than a full sync). With relaxed ordering, atomic loads and stores can be reordered and even omitted if one is determined to be redundant, but the atomic variable itself is consistent in all threads. (In particular, using only Relaxed ordering, if you set atomic variable A before atomic variable B, and then in another thread you read atomic variable B and determine it had been set, it's not guaranteed that reading atomic variable A after will get the value that was, in code, written to it before writing to B, because it might have been reordered, but it will have either its old value or its new value, and once it is set it will sync the new value)

There's two other memory orderings of note, Acquire and Release, which work together. Atomic load operations with the Acquire ordering synchronizes with atomic store operations on the same atomic variable with Release ordering, such that anything written to memory before the atomic store operation (2a) is available after the atomic load operation (2b), but only when Acquire and Release is used like that on the same atomic variable. If you use Acquire on one atomic and Release on another it doesn't mean anything with regards to synchronization.

(There's another ordering called Consume, but it's been deprecated, so I won't talk about that)

Acquire/Release is usually what you want when you want to sync data using atomics. It's faster than SeqCst on non-X86 because it's just one memory read barrier and one memory write barrier instead of two full memory barriers.

Relaxed ordering is not useful for data synchronization (2), but it's still useful for synchronizing the atomic itself (1). In SDL, we could use this e.g. for accessing the main thread id in SetMainReady/IsMainThread, or for reading and initializing the first timestamp in GetTicks/NS. Those only need synchronization of the atomic itself, so the memory barriers don't do anything useful, and skipping them makes it as fast as a regular variable on ARM64 in particular and still much faster than SeqCst on other archs.

After thinking on it a bit, I think it'd be nice to expose the memory ordering as an argument for these functions instead of restricting it to Relaxed only like the PR does currently. Later on, the implementation of SDL itself could also benefit from this on ARM and other platforms by using Acquire/Release instead of SeqCst.

@maia-s
Copy link
Contributor Author

maia-s commented Aug 29, 2025

I made a test program to demonstrate. You'll have to run this on ARM or other non-x86 arch to get meaningful results*. Compile with either USE_SEQ_CST, USE_ACQ_REL or USE_RELAXED defined.

On macos with M2 Pro, best out of 1000 runs:

  • USE_SEQ_CST: 911 917 ns
  • USE_ACQ_REL: 888 042 ns (not a big difference, but there's nothing to sync here)
  • USE_RELAXED: 485 333 ns (about half of USE_SEQ_CST)

(* Actually SEQ_CST is significantly slower than the other two on my linux x86-64 laptop, but I'm not sure why)

#include <SDL3/SDL.h>
#include <SDL3/SDL_main.h>
#include <stdio.h>

#define ITERATIONS 1000000

#ifdef USE_SEQ_CST
#define LOAD_ORDERING __ATOMIC_SEQ_CST
#define STORE_ORDERING __ATOMIC_SEQ_CST
#elif defined(USE_ACQ_REL)
#define LOAD_ORDERING __ATOMIC_ACQUIRE
#define STORE_ORDERING __ATOMIC_RELEASE
#elif defined(USE_RELAXED)
#define LOAD_ORDERING __ATOMIC_RELAXED
#define STORE_ORDERING __ATOMIC_RELAXED
#else
#error "define one of USE_SEQ_CST, USE_ACQ_REL or USE_RELAXED"
#endif

static int atomic;

static int thread_fn(void* data) {
    (void)data;
    for (int i = 0; i < ITERATIONS; ++i) {
        __atomic_store_n(&atomic, i, STORE_ORDERING);
        __asm__ volatile(""); // prevent optimizing out redundant relaxed stores
    }
    return 0;
}

int main(int argc, char* argv[]) {
    (void)argc;
    (void)argv;

    if (!SDL_Init(0)) {
        fprintf(stderr, "SDL_Init failed: %s\n", SDL_GetError());
        return 1;
    }

    Uint64 t0 = SDL_GetTicksNS();

    SDL_Thread* thread = SDL_CreateThread(thread_fn, "store", NULL);
    if (!thread) {
        fprintf(stderr, "SDL_CreateThread failed: %s\n", SDL_GetError());
        SDL_Quit();
        return 1;
    }

    while (__atomic_load_n(&atomic, LOAD_ORDERING) != ITERATIONS - 1) {}

    Uint64 ns = SDL_GetTicksNS() - t0;
    printf("%llu ns\n", (unsigned long long)ns);

    SDL_DetachThread(thread);
    SDL_Quit();
    return 0;
}

@icculus
Copy link
Collaborator

icculus commented Aug 29, 2025

Okay, I'm sold, this sounds useful.

@icculus
Copy link
Collaborator

icculus commented Aug 29, 2025

If we're being honest, most if not all of our own internal uses only need Relaxed atomics, too, I suspect.

@maia-s
Copy link
Contributor Author

maia-s commented Aug 29, 2025

Thanks! Do you want me to add an argument for the ordering so Acquire/Release can also be supported without more API symbols, or is this good as is? (I accidentally disabled atomic_load support for PS2 so I'll push a fix for that in a bit. update: also rebased to current main)

@icculus
Copy link
Collaborator

icculus commented Sep 1, 2025

I think I wouldn't complicate it with the extra parameter (and if we want acquire/release later, we should add new symbols at that point too).

@nfries88
Copy link

nfries88 commented Sep 3, 2025

Acquire and Release memory orders are the minimum required for spinlocks (locking is CAS-Acquire and unlocking is Store-Release) and several commonly used lockfree data structures can also get away without SeqCst but require more than Relaxed. While Relaxed is indeed useful for things like one-time initialization accesses it's pretty limited in utility elsewhere without also having the other memory orders. My recommendation would be to add them all.

@maia-s
Copy link
Contributor Author

maia-s commented Sep 3, 2025

I can add those if Sam and Ryan wants that.

Compare and swap would be a bit awkward without ordering arguments since it takes two, one for success and one for failure. E.g. you can use Release on success and Relaxed on failure so you don't pay for a sync if it's not needed. Tbf it'd be a bit awkward anyway since the orderings should be compile time constants or they may fall back to SeqCst.

@icculus
Copy link
Collaborator

icculus commented Sep 4, 2025

My opinion is we hold off on that and just do the Relaxed version, but I'll defer to Sam on this.

@slouken slouken added this to the 3.6.0 milestone Sep 4, 2025
@slouken
Copy link
Collaborator

slouken commented Sep 4, 2025

I'm going to bump this out to the 3.6 milestone where we can think about it in a relaxed manner. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants