Yep. And then we learned that transistors were almost free, could be purchased in lot quantities of 1 billion, and could be used to create a translation layer between a CISC instruction stream and a RISC core.

> and a RISC core.

not actually a RISC core. µops are simpler than the externally visible instruction set but they are not RISC.

(There were early x86 implementations that really were an x86 decoder bolted onto a preexisting RISC design. They didn't really perform well.)

Turns out it's useful to have memory operands -- even read-modify-write operands -- in your µops. Turns out it's useful to have instructions that are wider than 32 bits. Turns out it's useful to have big literals, even if the immediate field has to be shared by a whole group of 3-4 µops (so only one of them can have a big literal). Flags are also not necessarily the problem RISC people said it was (and several RISCs actually do have flags despite the Computer Architecture 101 dogmas). Just having an reg+ofs address mode turns out to be a bad idea.

Indirect addressing à la PDP-11, VAX, and 68K (and many long forgotten architectures) turned out not to be a good idea, of course.

Sure, it’s not coded the same way as your external ISA does it (like you said, lots of wide information), but the core is optimized for a load/store architecture with ops that generally execute in a single pipeline stage, and the big, complex instructions in the ISA map to multiple smaller micro-ops. But yes, it is NOT literally an x86 decoder wrapped around a MIPS or ARM core.

No. NOT a load/store architecture. People tried that in the 90's for x86. Doesn't work nearly as well as keeping µops generally the same as the simpler move/ALU instructions of the external ISA, just encoded differently. That's what modern x86 uses, that's what modern z/Arch ("S/360-31-64++") uses.

x86 and z/Arch instructions for move/ALU/jmp/conditional branch are fine. They are not hard to decode and they are not hard to execute. That's the core of the µop instruction set (just encoded differently). Then they add some specialty stuff necessary to implement more complex instructions using sequences of µops -- and of course the SIMD stuff. That's it. It's a simpler version of the external ISA, NOT a RISC.

Part of the path the µops take is wide, possibly with the option of a shared field for a large immediate.

Instructions that map to short, fixed-length µops sequences are handled directly by the decoders that spit out a wide chunk of µops. Longer or variable-length µops are handled by having the decoders emit an index into a ROM of wide µops chunks. The decoders can often spit out both the first wide chunks AND the index (so there's no need to wait a cycle for the ROM to emit µops).

Multi-cycle µops are not much of problem, as long as the cycle count is predictable, preferably statically predictable. It is common to have µops that are "multi-issue", for example if they involve memory operands.

Hm. Maybe I’m using terms incorrectly or my model for what is happening is incorrect. Please educate me.

“NOT a load/store architecture.” What do you mean by this, exactly?

“It's a simpler version of the external ISA, NOT a RISC.” How are defining “RISC,” exactly? What makes it not a RISC given that you’re also saying it’s simpler?

Transistors aren’t free. x86 cores are all bigger than their ARM competitors even when on the same node while also getting worse performance per watt.

The translation layers cost time and money to build which could be spent making the rest of the chip faster. They stuck up extra die area and use power.

ARM’s total income in 2024 was half of AMDs R&D budget, but the core they finished that year can get x86 desktop levels of performance in a phone.

The idea that there’s no cost to x86 just doesn’t seem to hold up under even mild scrutiny.

Yes, that’s why I said almost free. But that said, as I understand it, the x86 decoders aren’t that much of the die area of a modern design. Most of it is L1 cache, GPUs, neural engines, etc., which makes simple die area comparisons of modern processors a bit useless for this particular question. You’re really comparing all that other stuff. To be clear, I’m squarely on the RISC side of any “debate,” but it was interesting to watch how the CISC designs evolved in the early 2000s to maintain their relevance.

The “no cost” is an evidence free zone.

ARM shed 75% of their decoder size when removing support for the 32-bit ISA in their A-series cores and it’s nowhere near as bad as x86.

A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder. That’s 22% of total power. Even if you say that’s the high-water mark and it’s usually half of that, it’s still a massive consideration.

If the decoder power and size weren’t an issue, they’d have moved past 4-wide decode years ago. Instead, golden cove only recently went 6 wide and power consumption went through the roof. Meanwhile, AMD went with a ludicrously complex double 4-wide decoder setup that limits throughput under some circumstances and creates all kinds of headaches and hazards that must be dealt with.

Nobody would do this unless the cost of scaling x86 decided was immense. Instead, they’d just slap on another decoder like ARM or RISC-V does.

> A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder.

You are mixing studies, the Haskell paper only looked at total core paper (and checked how it was impacted by algorithms). It was this [1] study that provided the 4.8w out of 22.1w number.

And while it's an interesting number, it says nothing about the overhead of decoding x86. They chose to label that component "instruction decoder" but really they were measuring the whole process of fetching instructions from anywhere other than the μop cache.

That 4.8w number includes the branch predictor, the L1 instruction cache, potentially TLB misses and instruction fetch from L2/L3. And depending on the details of their regression analysis, it might also include things after instruction decoding, like register renaming and move elimination. Given that they show the L1 data cache as using 3.8w, it's entirely possible that the L1 instruction cache actually makes up most of the 4.8w number.

So we really can't use this paper to say anything about the overhead of decoding x86, because that's not what it's measuring.

[1] https://www.usenix.org/system/files/conference/cooldc16/cool...

I’m not sure what point you’re trying to get me to concede. I’ve already stated that I’m on the RISC side of the debate. If your point is that it’s difficult to keep doing this with x86, I won’t argue with you. But that said, the x86 teams have kept up remarkably well so far. How far can they keep going? I don’t know. They’re already a lot further than everyone predicted they’d be a couple decades ago. Nearly free transistors, even if not fully free, are quite useful, it turns out.

We'll see. Free transistors are over. Cost per transistor has been stagnant or slightly increasing since 28nm.

https://www.semiconductor-digest.com/moores-law-indeed-stopp...

Sure, they’re getting less free and they were never completely free in any case. But that’s a straw man that nobody was trying to argue. So (again), what’s your point?

This is not really true. For example, an Apple M1 has firestorm cores that are around 3.75mm^2, and a zen3 core from around the same era is just 3mm^2 (roughly).

I can make a silly 20 wide arm cpu and a silly 1 wide x86 cpu and the x86 will be smaller by a lot.

Where did you get your numbers?

M1 core is 2.28mm2 and Zen3 core is 4.05mm2 if you count all the core-specific stuff (and 3.09mm2 even if you exclude the power regulation and the L2 cache that only this core can use). That is 27-45% larger for generally worse performance (and all-around worse performance if per-core power is constrained to something realistic). I'd also note that Oryon and C1-Ultra seem to be much more area efficient than more recent Apple cores.

We're at a point where Apple's E-cores are getting close to Zen or Cove cores in IPC while using just 0.6w at 2.6GHz.

If you count EVERYTHING outside of the core (power, matrix co-processor, last-level-cache, share logic, etc), and average out, we get 15.555mm2 / 4 = 3.89mm2 for M1.

If we do the same for Zen3 (excluding tests and infinity fabric), we have 67.85mm2 / 8 = 8.49mm2.

M1 has 12mb of last-level cache (coherent) or 3mb per core while Zen3 has 4mb of coherent L1 and 32mb of victim cache (used to buffer IO/RAM read/write and to hold cache misses in hopes that they can be used eventually). You can analyze this how you would like, but M1's cache design is more efficient and gives a higher hit rate despite being smaller. Chips and Cheese has an interesting writeup of this as applied to Golden Cove.

https://semianalysis.com/2022/06/10/apple-m2-die-shot-and-ar...

https://x.com/Locuza_/status/1538696558920736769

https://semianalysis.com/2022/06/17/amd-to-infinity-and-beyo...

https://chipsandcheese.com/p/going-armchair-quarterback-on-g...

What point are you trying to argue with everyone? You seem to be quibbling over everything without stating an actual POV.

They called me out as being wrong then cited incorrect data to support their claim. I responded with the real numbers and sources to back them up. What else should be done?

Not for VAX, see the recent thread on it [1].

[1] https://news.ycombinator.com/item?id=45378413

I’m not sure I get your point? Are you saying it is impossible to accelerate the VAX instruction set via the same technique used on x86? If so, you’ll have to explain why. Now, whether you’d want to or not is another question.

Yes, it's not possible to do it the same way, because what made x86 successful in their application is that x86 was remarkably RISC-y in actual behaviour compared to 68k or VAX.

The main reason for that is that x86 code, outside of being register poor leading to lots of stack etc. use, decomposes most "ciscy" operations into LEA (inlined in pipeline) + memory access + actual operation.

VAX (and to lesser extent, m68k and others) had multiple indirections just to get the operands right, way more than what is essentially single LEA instruction. The most complex VAX instructions could have been ignored as "this is super slow one and rarely used", but the burden of handling the indirections remain, including possible huge memory latency costs.

Also, the VAX instruction encoding is a class of horror above that of x86.

few classes above.

Only ISA where I've seen a single instruction span two memory pages despite being page aligned

Two cache lines, sure, on earlier models, but not two VM pages!!

Maximum instruction length is 56 bytes. Early models had 8 byte cache lines, later ones 64 bytes. VM page 512 bytes.

I might be mistaken then, but I recall reading something about most extreme decode on VAX going into ~522 bytes.

What I am more certain of was complaints about possibly ending with maaaany TLB lookups (and pagetable walks) for certain "business" optimized instructions.