Isn't this a software problem being solved in hardware? Ideally you would try to avoid going to memory in the first place by fusing the operations, which should be much faster than speeding up memory ops. E.g. you should never do an explicit im2col before a convolution, it should be fused. However it's hard to argue with a 0.019 mm2 area increase.
In one view, the fact that it's a software problem is actually a weakness of (GPU) hardware design.
In the olden, serial computing days, our algorithms were standard, and CPU designers did all sorts of behind-the-scene tricks to improve performance without burdening software developers. It wasn't perfect abstraction, but they tried. Algorithm led the way; hardware had to follow.
CUDA threw that all away, exposed lots of ugly details of GPU hardware design that developers _had to_ take into account. This is why, for a long time, CUDA's primary customers (HPC community & Natl labs) refused to adopt CUDA.
It's interesting that now that CUDA has become a legitimate, widely adopted computing paradigm, how much our view on this has shifted.
You can still live your abstract, imperfect universe, there's nothing stopping you.
I don't believe you really can in GPU world. With CPU, if you ignore something important like cache hierarchy, the performance penalty is likely to be in double digits percentage. Something people can and do often ignore. With GPU, there are many many things (memory coalescing, warp, SRAM) that can have triple digits % of impact, hell maybe even more than that.
"Fusing im2col with matrix multiplication" is a confused way of saying that the convolution operation should be implemented directly in hardware.
There are two arguments in favor of im2col.
1. "I don't want to implement a dedicated software kernel just for convolutions" aka laziness
2. "I don't want to implement dedicated hardware just for convolution"
The former is a sham, the latter is motivated by silicon area constraints. Implementing convolutions requires exactly the same number of FMAs, so you would end up doubling your chip size and automatically be cursed with 50% utilization from the start unless you do both matrix multiplication and convolutions simultaneously.
When you read answers like this: https://stackoverflow.com/a/47422548, they are subtly wrong.
"Element wise convolution performs badly because of the irregular memory accesses involved in it." at a first glance sounds like a reasonable argument, but all you're doing with im2col is shifting the "irregular memory accesses" into a separate kernel. It doesn't fundamentally get rid of the "irregular memory accesses".
The problem with the answer is that the irregularity is purely a result of ones perspective. Assuming you implement im2col in hardware, there is in fact nothing difficult about the irregularity. In fact, what is considered irregular here is perfectly predictable from the perspective of the hardware.
All you do is load x pixels from y rows simultaneously, which is extremely data parallel and SIMD friendly. Once the data is in local registers, you can access it any way you want (each register is effectively its own bank), which allows you to easily produce the im2col output stream and feed it straight to your matrix multiplication unit. You could have implemented the convolution directly, but then again you'd only get 50% utilization due to inflexibility.
they compare im2col performance with a GPU, while you don't need explicit im2col on a GPU