-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add relaxed atomics #13822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add relaxed atomics #13822
Conversation
743b6ff
to
ffc4d29
Compare
ed8d14b
to
87aeb75
Compare
I'm not sure these are worth adding a whole bunch of API entry points to SDL. @icculus, thoughts? |
87aeb75
to
03dbf73
Compare
Adding the memory ordering as arguments would be more generally useful to apps, but of course we'd have to add the enum too for that, and some logic for MSVC to choose the right functions In general you usually want to use Acquire/Release for atomics, but it doesn't make a difference on x86 since strong ordering comes for free there |
Isn't a large part of the reason for atomics to guarantee memory ordering? Help me understand the value of this. |
Yes, but the issue is that SDL currently only allows for one kind of memory ordering, the strongest one (sequentially consistent ordering, or SeqCst for short). This isn't really an issue on X86/X86-64, because those processors do SeqCst automatically, and so using SeqCst atomics doesn't cost anything extra on X86, but it can make a big difference on ARM/ARM64 and other processors. (Apple's M-processors have a feature to enable automatic SeqCst ordering in Rosetta 2 to make X86-64 emulation more efficient, but that can't be enabled for normal code AFAIK.) Memory ordering is easiest to explain if you think about atomics as synchronizing TWO pieces of data:
The atomic variable itself is always synchronized, and you'll never get a partially synchronized value. Once a value is stored to an atomic variable, you'll get that value when you load that atomic variable in the same or another thread. This is the same for all memory orderings. Note that this does not on its own guarantee ordering of anything else with respect to that atomic, not even other atomics. The SeqCst ordering synchronizes everything. This is the only ordering SDL supports today. When you access an atomic with SeqCst, it acts as a total memory barrier. This costs nothing more than usual on X86 as explained above, but can be expensive on other archs like ARM. At the other extreme is the Relaxed ordering (I'll capitalize the orderings to distinguish them from the regular english words). Relaxed ordering synchronizes ONLY the atomic variable itself. Other memory is not synchronized at all. On ARM64, this makes a big difference: Accessing a Relaxed atomic is exactly as efficient as accessing normal memory (on 32-bit ARM I think it's slightly less efficient, but still better than a full sync). With relaxed ordering, atomic loads and stores can be reordered and even omitted if one is determined to be redundant, but the atomic variable itself is consistent in all threads. (In particular, using only Relaxed ordering, if you set atomic variable A before atomic variable B, and then in another thread you read atomic variable B and determine it had been set, it's not guaranteed that reading atomic variable A after will get the value that was, in code, written to it before writing to B, because it might have been reordered, but it will have either its old value or its new value, and once it is set it will sync the new value) There's two other memory orderings of note, Acquire and Release, which work together. Atomic load operations with the Acquire ordering synchronizes with atomic store operations on the same atomic variable with Release ordering, such that anything written to memory before the atomic store operation (2a) is available after the atomic load operation (2b), but only when Acquire and Release is used like that on the same atomic variable. If you use Acquire on one atomic and Release on another it doesn't mean anything with regards to synchronization. (There's another ordering called Consume, but it's been deprecated, so I won't talk about that) Acquire/Release is usually what you want when you want to sync data using atomics. It's faster than SeqCst on non-X86 because it's just one memory read barrier and one memory write barrier instead of two full memory barriers. Relaxed ordering is not useful for data synchronization (2), but it's still useful for synchronizing the atomic itself (1). In SDL, we could use this e.g. for accessing the main thread id in SetMainReady/IsMainThread, or for reading and initializing the first timestamp in GetTicks/NS. Those only need synchronization of the atomic itself, so the memory barriers don't do anything useful, and skipping them makes it as fast as a regular variable on ARM64 in particular and still much faster than SeqCst on other archs. After thinking on it a bit, I think it'd be nice to expose the memory ordering as an argument for these functions instead of restricting it to Relaxed only like the PR does currently. Later on, the implementation of SDL itself could also benefit from this on ARM and other platforms by using Acquire/Release instead of SeqCst. |
I made a test program to demonstrate. You'll have to run this on ARM or other non-x86 arch to get meaningful results*. Compile with either USE_SEQ_CST, USE_ACQ_REL or USE_RELAXED defined. On macos with M2 Pro, best out of 1000 runs:
(* Actually SEQ_CST is significantly slower than the other two on my linux x86-64 laptop, but I'm not sure why) #include <SDL3/SDL.h>
#include <SDL3/SDL_main.h>
#include <stdio.h>
#define ITERATIONS 1000000
#ifdef USE_SEQ_CST
#define LOAD_ORDERING __ATOMIC_SEQ_CST
#define STORE_ORDERING __ATOMIC_SEQ_CST
#elif defined(USE_ACQ_REL)
#define LOAD_ORDERING __ATOMIC_ACQUIRE
#define STORE_ORDERING __ATOMIC_RELEASE
#elif defined(USE_RELAXED)
#define LOAD_ORDERING __ATOMIC_RELAXED
#define STORE_ORDERING __ATOMIC_RELAXED
#else
#error "define one of USE_SEQ_CST, USE_ACQ_REL or USE_RELAXED"
#endif
static int atomic;
static int thread_fn(void* data) {
(void)data;
for (int i = 0; i < ITERATIONS; ++i) {
__atomic_store_n(&atomic, i, STORE_ORDERING);
__asm__ volatile(""); // prevent optimizing out redundant relaxed stores
}
return 0;
}
int main(int argc, char* argv[]) {
(void)argc;
(void)argv;
if (!SDL_Init(0)) {
fprintf(stderr, "SDL_Init failed: %s\n", SDL_GetError());
return 1;
}
Uint64 t0 = SDL_GetTicksNS();
SDL_Thread* thread = SDL_CreateThread(thread_fn, "store", NULL);
if (!thread) {
fprintf(stderr, "SDL_CreateThread failed: %s\n", SDL_GetError());
SDL_Quit();
return 1;
}
while (__atomic_load_n(&atomic, LOAD_ORDERING) != ITERATIONS - 1) {}
Uint64 ns = SDL_GetTicksNS() - t0;
printf("%llu ns\n", (unsigned long long)ns);
SDL_DetachThread(thread);
SDL_Quit();
return 0;
} |
Okay, I'm sold, this sounds useful. |
If we're being honest, most if not all of our own internal uses only need Relaxed atomics, too, I suspect. |
Thanks! Do you want me to add an argument for the ordering so Acquire/Release can also be supported without more API symbols, or is this good as is? (I accidentally disabled atomic_load support for PS2 so I'll push a fix for that in a bit. update: also rebased to current main) |
03dbf73
to
6d5910e
Compare
I think I wouldn't complicate it with the extra parameter (and if we want acquire/release later, we should add new symbols at that point too). |
Acquire and Release memory orders are the minimum required for spinlocks (locking is CAS-Acquire and unlocking is Store-Release) and several commonly used lockfree data structures can also get away without SeqCst but require more than Relaxed. While Relaxed is indeed useful for things like one-time initialization accesses it's pretty limited in utility elsewhere without also having the other memory orders. My recommendation would be to add them all. |
I can add those if Sam and Ryan wants that. Compare and swap would be a bit awkward without ordering arguments since it takes two, one for success and one for failure. E.g. you can use Release on success and Relaxed on failure so you don't pay for a sync if it's not needed. Tbf it'd be a bit awkward anyway since the orderings should be compile time constants or they may fall back to SeqCst. |
My opinion is we hold off on that and just do the Relaxed version, but I'll defer to Sam on this. |
I'm going to bump this out to the 3.6 milestone where we can think about it in a relaxed manner. ;) |
This PR adds relaxed versions of all the atomic functions to SDL. This is useful for #13806 and also for apps in general for when you want to access an atomic without any other synchronization, which can be faster (e.g. on ARM). Relaxed functions are currently implemented for GCC, Clang and MSVC (ARM only), and fall back to the regular synced version if relaxed atomics aren't available on the current platform.
This hardcodes relaxed memory ordering, similar to how the current functions hardcode seqcst. As an alternative to this, we could add a memory ordering enum and functions that take that as an argument. That'd let users use acquire and release ordering too, which is often desirable.