1. Apple’s optimizations are one point in their favor. XNU is good, and Apple’s memory management is excellent.

2. X86 micro-ops vs ARM decode are not equivalent. X86’s variable length instructions make the whole process far more complicated than it is on something like ARM. This is a penalty due to legacy design.

3. The OP was talking about M1. AFAIK, M4 is now 10-wide, and most x86 is 6-wide (Ryzen 5 does some weird stuff). X86 was 4-wide at the time of M1’s introduction.

4. M1 has over 600 reorder buffer registers… it’s significantly larger than competitors.

5. Close relative to x86 competitors.

> 4. M1 has over 600 reorder buffer registers… it’s significantly larger than competitors.

And? Are you saying neither Intel nor AMD engineers were able to determine that this was a bottleneck worth chasing? The point was, anybody could add more cache, rename, reorder or whatever buffers they wanted to... it's not Apple secret-sauce.

If all the competition knew they were leaving all this performance/efficiency on the table despite there being a relatively simple fix, that's on them. They got overtaken by a competitor with a better offering.

If all the competition didn't realize they were leaving all this performance/efficiency on the table despite there being a relatively simple fix, that's also on them. They got overtaken by a competitor with better offering AND more effective engineers.