I'm not familiar with GPU architecture, is there not a shared L2/L3 data cache from which this data would be shared?

MMU has a finite amount of ports that drive the data to the consumers. An extreme case: all 32 cores want the same piece of data at the same time.