> Apple CPU's are even optimized for Apple software GC calls like Retain/Release objects.

I assume this is referring to the tweet from the launch of the M1 showing off that retaining and releasing an NSObject is like 3x faster. That's more of a general case of the ARM ISA being a better fit for modern software than x86, not some specific optimization for Apple's software.

x86 was designed long before desktops had multi-core processors and out-of-order execution, so for backwards compatibility reasons the architecture severely restricts how the processor is allowed to reorder memory operations. ARM was designed later, and requires software to explicitly request synchronization of memory operations where it's needed, which is much more performant and a closer match for the expectations of modern software, particularly post-C/C++11 (which have a weak memory model at the language level).

Reference counting operations are simple atomic increments and decrements, and when your software uses these operations heavily (like Apple's does), it can benefit significantly from running on hardware with a weak memory model.

> I assume this is referring to the tweet from the launch of the M1 showing off that retaining and releasing an NSObject is like 3x faster. That's more of a general case of the ARM ISA being a better fit for modern software than x86, not some specific optimization for Apple's software.

It's not really even the ISA, mainly the implementation. Atomics on Apple cores are 3x faster than Intel (18 cycles back to back latency vs 6). AMD's atomics have 6 cycle latency.