The “no cost” is an evidence free zone.

ARM shed 75% of their decoder size when removing support for the 32-bit ISA in their A-series cores and it’s nowhere near as bad as x86.

A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder. That’s 22% of total power. Even if you say that’s the high-water mark and it’s usually half of that, it’s still a massive consideration.

If the decoder power and size weren’t an issue, they’d have moved past 4-wide decode years ago. Instead, golden cove only recently went 6 wide and power consumption went through the roof. Meanwhile, AMD went with a ludicrously complex double 4-wide decoder setup that limits throughput under some circumstances and creates all kinds of headaches and hazards that must be dealt with.

Nobody would do this unless the cost of scaling x86 decided was immense. Instead, they’d just slap on another decoder like ARM or RISC-V does.

> A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder.

You are mixing studies, the Haskell paper only looked at total core paper (and checked how it was impacted by algorithms). It was this [1] study that provided the 4.8w out of 22.1w number.

And while it's an interesting number, it says nothing about the overhead of decoding x86. They chose to label that component "instruction decoder" but really they were measuring the whole process of fetching instructions from anywhere other than the μop cache.

That 4.8w number includes the branch predictor, the L1 instruction cache, potentially TLB misses and instruction fetch from L2/L3. And depending on the details of their regression analysis, it might also include things after instruction decoding, like register renaming and move elimination. Given that they show the L1 data cache as using 3.8w, it's entirely possible that the L1 instruction cache actually makes up most of the 4.8w number.

So we really can't use this paper to say anything about the overhead of decoding x86, because that's not what it's measuring.

[1] https://www.usenix.org/system/files/conference/cooldc16/cool...

I’m not sure what point you’re trying to get me to concede. I’ve already stated that I’m on the RISC side of the debate. If your point is that it’s difficult to keep doing this with x86, I won’t argue with you. But that said, the x86 teams have kept up remarkably well so far. How far can they keep going? I don’t know. They’re already a lot further than everyone predicted they’d be a couple decades ago. Nearly free transistors, even if not fully free, are quite useful, it turns out.

We'll see. Free transistors are over. Cost per transistor has been stagnant or slightly increasing since 28nm.

https://www.semiconductor-digest.com/moores-law-indeed-stopp...

Sure, they’re getting less free and they were never completely free in any case. But that’s a straw man that nobody was trying to argue. So (again), what’s your point?