This seems mostly misinformed.
1) Apple Silicon outperforms all laptop CPUs in the same power envelope on 1T on industry-standard tests: it's not predominantly due to "optimizing their software stack". SPECint, SPECfp, Geekbench, Cinebench, etc. all show major improvements.
2) x86 also heavily relies on micro-ops to greatly improve performance. This is not a "penalty" in any sense.
3) x86 is now six-wide, eight-wide, or nine-wide (with asterisks) for decode width on all major Intel & AMD cores. The myth of x86 being stuck on four-wide has been long disproven.
4) Large buffers, L1, L2, L3, caches, etc. are not exclusive to any CPU microarchitecture. Anyone can increase them—the question is, how much does your core benefit from larger cache features?
5) Ryzen AI Max 300 (Strix Halo) gets nowhere near Apple on 1T perf / W and still loses on 1T perf. Strix Halo uses slower CPUs versus the beastly 9950X below:
Fanless iPad M4 P-core SPEC2017 int, fp, geomean: 10.61, 15.58, 12.85 AMD 9950X (Zen5) SPEC2017 int, fp, geomean: 10.14, 15.18, 12.41 Intel 285K (Lion Cove) SPEC2017 int, fp, geomean: 9.81, 12.44, 11.05
Source: https://youtu.be/2jEdpCMD5E8?t=185, https://youtu.be/ymoiWv9BF7Q?t=670
The 9950X & 285K eat 20W+ per core for that 1T perf; the M4 uses ~7W. Apple has a node advantage, but no node on Earth gives you 50% less power.
There is no contest.
1. Apple’s optimizations are one point in their favor. XNU is good, and Apple’s memory management is excellent.
2. X86 micro-ops vs ARM decode are not equivalent. X86’s variable length instructions make the whole process far more complicated than it is on something like ARM. This is a penalty due to legacy design.
3. The OP was talking about M1. AFAIK, M4 is now 10-wide, and most x86 is 6-wide (Ryzen 5 does some weird stuff). X86 was 4-wide at the time of M1’s introduction.
4. M1 has over 600 reorder buffer registers… it’s significantly larger than competitors.
5. Close relative to x86 competitors.
> 4. M1 has over 600 reorder buffer registers… it’s significantly larger than competitors.
And? Are you saying neither Intel nor AMD engineers were able to determine that this was a bottleneck worth chasing? The point was, anybody could add more cache, rename, reorder or whatever buffers they wanted to... it's not Apple secret-sauce.
If all the competition knew they were leaving all this performance/efficiency on the table despite there being a relatively simple fix, that's on them. They got overtaken by a competitor with a better offering.
If all the competition didn't realize they were leaving all this performance/efficiency on the table despite there being a relatively simple fix, that's also on them. They got overtaken by a competitor with better offering AND more effective engineers.
>> x86 is now six-wide, eight-wide, or nine-wide (with asterisks) for decode width on all major Intel & AMD cores. The myth of x86 being stuck on four-wide has been long disproven.
From the AMD side it was 4 wide until Zen 5. And now it's still 4 wide, but there is a separate 4-wide decoder for each thread. The micro-op cache can deliver a lot of pre-decoded instructions so the issue width is (I dunno) wider but the decode width is still 4.
2. uops are a cope that costs. That uop cache and cache controller uses tons of power. ARM designs with 32-bit support had a uop cache, but they cut it when going to 64-bit only designs (look at ARM a715 vs a710) which dramatically reduced frontend size and power consumption.
3. The claim was never "stuck on 4-wide", but that going wider would incur significant penalties which is the case. AMD uses two 4-wide encoders and pays a big penalty in complexity trying to keep them coherent and occupied. Intel went 6-wide for Golden Cove which is infamous for being the largest and most power-hungry x86 design in a couple decades. This seems to prove the 4-wide people right.
4. This is only partially true. The ISA impacts which designs make sense which then impacts cache size. uop cache can affect L1 I-cache size. Page size and cache line size also affect L1 cache sizes. Target clockspeeds and cache latency also affect which cache sizes are viable.
> 2) x86 also heavily relies on micro-ops to greatly improve performance. This is not a "penalty" in any sense.
It's an energy penalty, even if wall clock time improves.
A whole lot of bluster in this thread but finally someone whose actually doing their research chimes in. Thank you for giving me a place to start in understanding why this is so deeply a mystery!