Transistors aren’t free. x86 cores are all bigger than their ARM competitors even when on the same node while also getting worse performance per watt.
The translation layers cost time and money to build which could be spent making the rest of the chip faster. They stuck up extra die area and use power.
ARM’s total income in 2024 was half of AMDs R&D budget, but the core they finished that year can get x86 desktop levels of performance in a phone.
The idea that there’s no cost to x86 just doesn’t seem to hold up under even mild scrutiny.
Yes, that’s why I said almost free. But that said, as I understand it, the x86 decoders aren’t that much of the die area of a modern design. Most of it is L1 cache, GPUs, neural engines, etc., which makes simple die area comparisons of modern processors a bit useless for this particular question. You’re really comparing all that other stuff. To be clear, I’m squarely on the RISC side of any “debate,” but it was interesting to watch how the CISC designs evolved in the early 2000s to maintain their relevance.
The “no cost” is an evidence free zone.
ARM shed 75% of their decoder size when removing support for the 32-bit ISA in their A-series cores and it’s nowhere near as bad as x86.
A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder. That’s 22% of total power. Even if you say that’s the high-water mark and it’s usually half of that, it’s still a massive consideration.
If the decoder power and size weren’t an issue, they’d have moved past 4-wide decode years ago. Instead, golden cove only recently went 6 wide and power consumption went through the roof. Meanwhile, AMD went with a ludicrously complex double 4-wide decoder setup that limits throughput under some circumstances and creates all kinds of headaches and hazards that must be dealt with.
Nobody would do this unless the cost of scaling x86 decided was immense. Instead, they’d just slap on another decoder like ARM or RISC-V does.
> A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder.
You are mixing studies, the Haskell paper only looked at total core paper (and checked how it was impacted by algorithms). It was this [1] study that provided the 4.8w out of 22.1w number.
And while it's an interesting number, it says nothing about the overhead of decoding x86. They chose to label that component "instruction decoder" but really they were measuring the whole process of fetching instructions from anywhere other than the μop cache.
That 4.8w number includes the branch predictor, the L1 instruction cache, potentially TLB misses and instruction fetch from L2/L3. And depending on the details of their regression analysis, it might also include things after instruction decoding, like register renaming and move elimination. Given that they show the L1 data cache as using 3.8w, it's entirely possible that the L1 instruction cache actually makes up most of the 4.8w number.
So we really can't use this paper to say anything about the overhead of decoding x86, because that's not what it's measuring.
[1] https://www.usenix.org/system/files/conference/cooldc16/cool...
I’m not sure what point you’re trying to get me to concede. I’ve already stated that I’m on the RISC side of the debate. If your point is that it’s difficult to keep doing this with x86, I won’t argue with you. But that said, the x86 teams have kept up remarkably well so far. How far can they keep going? I don’t know. They’re already a lot further than everyone predicted they’d be a couple decades ago. Nearly free transistors, even if not fully free, are quite useful, it turns out.
We'll see. Free transistors are over. Cost per transistor has been stagnant or slightly increasing since 28nm.
https://www.semiconductor-digest.com/moores-law-indeed-stopp...
Sure, they’re getting less free and they were never completely free in any case. But that’s a straw man that nobody was trying to argue. So (again), what’s your point?
This is not really true. For example, an Apple M1 has firestorm cores that are around 3.75mm^2, and a zen3 core from around the same era is just 3mm^2 (roughly).
I can make a silly 20 wide arm cpu and a silly 1 wide x86 cpu and the x86 will be smaller by a lot.
Where did you get your numbers?
M1 core is 2.28mm2 and Zen3 core is 4.05mm2 if you count all the core-specific stuff (and 3.09mm2 even if you exclude the power regulation and the L2 cache that only this core can use). That is 27-45% larger for generally worse performance (and all-around worse performance if per-core power is constrained to something realistic). I'd also note that Oryon and C1-Ultra seem to be much more area efficient than more recent Apple cores.
We're at a point where Apple's E-cores are getting close to Zen or Cove cores in IPC while using just 0.6w at 2.6GHz.
If you count EVERYTHING outside of the core (power, matrix co-processor, last-level-cache, share logic, etc), and average out, we get 15.555mm2 / 4 = 3.89mm2 for M1.
If we do the same for Zen3 (excluding tests and infinity fabric), we have 67.85mm2 / 8 = 8.49mm2.
M1 has 12mb of last-level cache (coherent) or 3mb per core while Zen3 has 4mb of coherent L1 and 32mb of victim cache (used to buffer IO/RAM read/write and to hold cache misses in hopes that they can be used eventually). You can analyze this how you would like, but M1's cache design is more efficient and gives a higher hit rate despite being smaller. Chips and Cheese has an interesting writeup of this as applied to Golden Cove.
https://semianalysis.com/2022/06/10/apple-m2-die-shot-and-ar...
https://x.com/Locuza_/status/1538696558920736769
https://semianalysis.com/2022/06/17/amd-to-infinity-and-beyo...
https://chipsandcheese.com/p/going-armchair-quarterback-on-g...
What point are you trying to argue with everyone? You seem to be quibbling over everything without stating an actual POV.
They called me out as being wrong then cited incorrect data to support their claim. I responded with the real numbers and sources to back them up. What else should be done?