> The important thing to remember is that each of these cannot be split into separate instructions.

Nitpick, but they absolutely can be split into several instructions, and this is the most common way it’s implemented on RISClike processors, and also single instructions aren’t necessarily atomic.

The actual guarantee is that the entire operation (load, store, RMW, whatever) occurs in one “go” and no other thread can perform an operation on that variable during this atomic operation (it can’t write into the low byte of your variable as you read it).

It’s probably best euphemismed by imagining that every atomic operation is just the normal operation wrapped in a mutex, but implemented in a much more efficient manner. Of course, with large enough types, Atomic variables may well be implemented via a mutex

> It’s probably best euphemismed by imagining that every atomic operation is just the normal operation wrapped in a mutex

Sort-of, but that's not quite right: if you imagine it as "taking a mutex on the memory", there's a possibility of starvation. Imagine two threads repeatedly "locking" the memory location to update it. With a mutex, it's possible that one of them get starved, never getting to update the location, stalling indefinitely.

At least x86 (and I'm sure ARM and RISC-V as well) make a much stronger progress guarantee than a mutex would: the operation is effectively wait-free. All threads are guaranteed to make progress in some finite amount of time, no one will be starved. At least, that's my understanding from reading much smarter people talking about the cache synchronization protocols of modern ISAs.

Given that, I think a better mental model is roughly something like the article describes: the operation might be slower under high contention, but not "blocking" slow, it is guaranteed to finish in a finite amount of time and atomically ("in one combined operation").

x86 and modern ARM64 has instructions for this, but original ARM and RISC approaches are literally a hardware-assisted polling loop. Unsure what guarantees they make, I shall look.

Definitely a good clarification though yeah, important

I got curious about this, because really the most important guarantee is in the C++ memory model, not any of the ISAs, and conforming compilers are required to fulfill them (and it's generally more restrictive than any of the platform guarantees). It's a little bit hard to parse, but if I'm reading section 6.9.2.3 of the C++23 standard (from here [1]), operations on atomics are only lock-free, not wait-free, and even that might be a high bar on certain platforms:

    Executions of atomic functions that are either defined to be lock-free 
    (33.5.10) or indicated as lock-free (33.5.5) are lock-free executions.

    When one or more lock-free executions run concurrently, at least one should 
    complete.

        [Note 3 : It is difficult for some implementations to provide absolute 
        guarantees to this effect, since repeated and particularly inopportune 
        interference from other threads could prevent forward progress, e.g., by 
        repeatedly stealing a cache line for unrelated purposes between load-locked 
        and store-conditional instructions. For implementations that follow this 
        recommendation and ensure that such effects cannot indefinitely delay progress 
        under expected operating conditions, such anomalies can therefore safely be 
        ignored by programmers. Outside this document, this property is sometimes 
        termed lock-free. — end note]
I'm guessing that note is for platforms like you mention, where the underlying ISA makes this (more or less) impossible. I would assume in the modern versions of these ISAs though, essentially everything in std::atomic is wait-free, in practice.

[1] https://open-std.org/jtc1/sc22/wg21/docs/papers/2023/n4950.p...

My understanding is that C++ memory model essentially pulled the concurrency bit from their own imagination, and as a result the only architecture that actually maps to it is RISC-V which explicitly decided to support it.

Really finicky ones, and the initial ones made none. I think for RISC-V it's something like max 16 instructions covered with no other memory accesses to be assured progress on ll/sc sequences.

That's to enable very minimal hardware implementations that can only track one line at a time.

Early Alpha CPUs (no idea about laters) essentially had a single special register that asserted a mutex lock on a word-sized (64bit) location for any read/write operation on the Sysbus.