If multiple cores tries to get the same memory addresses, the MMU feeds only one core, the second one have to whait. Depends on the type of RAM, this will cost a lot of cycles.
GPU MMUs can handle multiple line in parallel. But not 10k cores at the same time. The HBM is not able to transfer 3.5TByte sequencial.
Not necessarily the exact same address (you can fix that in a program anyways with a broadcast tree), but same memory bank. Imagine 1000 trains leaving a small town at the same time, instead of 1000 trains leaving 1000 different towns simultaneously. At some point there are not enough transportation resources to get stuff out of a particular area at the parallelism desired.
This is not my domain, but I assume the MMUs acting like a switch and something like multicast is not available here. I‘ve tried to implement such on a FPGA and it was extremely cost intensiv.
I believe it's that the bus can only serve one chip at a time, so it has to actually be faster since sometimes one chip's data will have to wait for the data of another chip to finish first.
If multiple cores tries to get the same memory addresses, the MMU feeds only one core, the second one have to whait. Depends on the type of RAM, this will cost a lot of cycles.
GPU MMUs can handle multiple line in parallel. But not 10k cores at the same time. The HBM is not able to transfer 3.5TByte sequencial.
Why is that? It seems like multiple cores requesting the same address would be easier for the MMU to fetch for, not harder.
Not necessarily the exact same address (you can fix that in a program anyways with a broadcast tree), but same memory bank. Imagine 1000 trains leaving a small town at the same time, instead of 1000 trains leaving 1000 different towns simultaneously. At some point there are not enough transportation resources to get stuff out of a particular area at the parallelism desired.
It’s not that the fetching is the problem, but serving the data to many cores at the same time from a single source.
I'm not familiar with GPU architecture, is there not a shared L2/L3 data cache from which this data would be shared?
MMU has a finite amount of ports that drive the data to the consumers. An extreme case: all 32 cores want the same piece of data at the same time.
This is not my domain, but I assume the MMUs acting like a switch and something like multicast is not available here. I‘ve tried to implement such on a FPGA and it was extremely cost intensiv.
I believe it's that the bus can only serve one chip at a time, so it has to actually be faster since sometimes one chip's data will have to wait for the data of another chip to finish first.