Hacker News

It’s not that the fetching is the problem, but serving the data to many cores at the same time from a single source.

I'm not familiar with GPU architecture, is there not a shared L2/L3 data cache from which this data would be shared?

MMU has a finite amount of ports that drive the data to the consumers. An extreme case: all 32 cores want the same piece of data at the same time.