All the various bits that get tacked on for doing prefetch and branch prediction all are fairly large too, given the amount of random caching, which often is what people account for when measuring decode power usage I think. That’s going to be the case in any arch besides something like a DSP without any kind of dynamic dispatch.
I think it's safe to say that a modern x86 branch predictor with its BTBs is significantly larger than the decode block.
Sure, but branch prediction is (as far as we know) a necessary evil. Decode complexity simply isn't.
Right, but decode compexity doesn't matter because of the giant BTB and such. At least that's what I understand.
For the cores working hardest to achieve the absolute lowest cpi running user code, this is true. But these days the computers have computers in them to manage the system. And these kinds of statements aren’t necessarily true for these “inner cores” that aren’t user accessible.
“ RTKit: Apple's proprietary real-time operating system. Most of the accelerators (AGX, ANE, AOP, DCP, AVE, PMP) run RTKit on an internal processor. The string "RTKSTACKRTKSTACK" is characteristic of a firmware containing RTKit.”
https://asahilinux.org/docs/project/glossary/#r
And those cores do not run x86.
I was pretty surprised to find out that the weird non-architectural cores in a Core or Xeon really do run x86 code.