x86 and modern ARM64 has instructions for this, but original ARM and RISC approaches are literally a hardware-assisted polling loop. Unsure what guarantees they make, I shall look.

Definitely a good clarification though yeah, important

I got curious about this, because really the most important guarantee is in the C++ memory model, not any of the ISAs, and conforming compilers are required to fulfill them (and it's generally more restrictive than any of the platform guarantees). It's a little bit hard to parse, but if I'm reading section 6.9.2.3 of the C++23 standard (from here [1]), operations on atomics are only lock-free, not wait-free, and even that might be a high bar on certain platforms:

    Executions of atomic functions that are either defined to be lock-free 
    (33.5.10) or indicated as lock-free (33.5.5) are lock-free executions.

    When one or more lock-free executions run concurrently, at least one should 
    complete.

        [Note 3 : It is difficult for some implementations to provide absolute 
        guarantees to this effect, since repeated and particularly inopportune 
        interference from other threads could prevent forward progress, e.g., by 
        repeatedly stealing a cache line for unrelated purposes between load-locked 
        and store-conditional instructions. For implementations that follow this 
        recommendation and ensure that such effects cannot indefinitely delay progress 
        under expected operating conditions, such anomalies can therefore safely be 
        ignored by programmers. Outside this document, this property is sometimes 
        termed lock-free. — end note]
I'm guessing that note is for platforms like you mention, where the underlying ISA makes this (more or less) impossible. I would assume in the modern versions of these ISAs though, essentially everything in std::atomic is wait-free, in practice.

[1] https://open-std.org/jtc1/sc22/wg21/docs/papers/2023/n4950.p...

My understanding is that C++ memory model essentially pulled the concurrency bit from their own imagination, and as a result the only architecture that actually maps to it is RISC-V which explicitly decided to support it.

Really finicky ones, and the initial ones made none. I think for RISC-V it's something like max 16 instructions covered with no other memory accesses to be assured progress on ll/sc sequences.

That's to enable very minimal hardware implementations that can only track one line at a time.