Well, there is a major architectural reason why the entire M-series appears to be "so fast" and that is the unified memory, which completely eliminates the buffer-to-buffer data copying that is probably over half of what a non-unified memory architecture chip is doing at any given time. M-series chips have an architecture that completely eliminates data copying, just reference the data where it is, and you're done.
I really like the principles behind AMD's chiplet design, of course they've had different design goals behind it (easier diversification of their product portfolio), but it remains a fact that you can slap a not-so-terrible GPU right next to a CPU core.
There's probably a lot still missing: Apple integrated the memory on the same die, and built Metal for software to directly take advantage of that design. That's the competitive advantage of vertical integration.
> Apple integrated the memory on the same die
It's on the same package but not the same die
Apple made a big deal about this, but other iGPUs have done this for years.
It's not just the GPU memory, it's also I/O memory. That speeds up a lot: just update the pointer to where the memory is, no copying out of I/O memory.
Is that what game consoles have done for years?