1. Apple’s optimizations are one point in their favor. XNU is good, and Apple’s memory management is excellent.
2. X86 micro-ops vs ARM decode are not equivalent. X86’s variable length instructions make the whole process far more complicated than it is on something like ARM. This is a penalty due to legacy design.
3. The OP was talking about M1. AFAIK, M4 is now 10-wide, and most x86 is 6-wide (Ryzen 5 does some weird stuff). X86 was 4-wide at the time of M1’s introduction.
4. M1 has over 600 reorder buffer registers… it’s significantly larger than competitors.
5. Close relative to x86 competitors.
> 4. M1 has over 600 reorder buffer registers… it’s significantly larger than competitors.
And? Are you saying neither Intel nor AMD engineers were able to determine that this was a bottleneck worth chasing? The point was, anybody could add more cache, rename, reorder or whatever buffers they wanted to... it's not Apple secret-sauce.
If all the competition knew they were leaving all this performance/efficiency on the table despite there being a relatively simple fix, that's on them. They got overtaken by a competitor with a better offering.
If all the competition didn't realize they were leaving all this performance/efficiency on the table despite there being a relatively simple fix, that's also on them. They got overtaken by a competitor with better offering AND more effective engineers.