"Fusing im2col with matrix multiplication" is a confused way of saying that the convolution operation should be implemented directly in hardware.

There are two arguments in favor of im2col.

1. "I don't want to implement a dedicated software kernel just for convolutions" aka laziness

2. "I don't want to implement dedicated hardware just for convolution"

The former is a sham, the latter is motivated by silicon area constraints. Implementing convolutions requires exactly the same number of FMAs, so you would end up doubling your chip size and automatically be cursed with 50% utilization from the start unless you do both matrix multiplication and convolutions simultaneously.

When you read answers like this: https://stackoverflow.com/a/47422548, they are subtly wrong.

"Element wise convolution performs badly because of the irregular memory accesses involved in it." at a first glance sounds like a reasonable argument, but all you're doing with im2col is shifting the "irregular memory accesses" into a separate kernel. It doesn't fundamentally get rid of the "irregular memory accesses".

The problem with the answer is that the irregularity is purely a result of ones perspective. Assuming you implement im2col in hardware, there is in fact nothing difficult about the irregularity. In fact, what is considered irregular here is perfectly predictable from the perspective of the hardware.

All you do is load x pixels from y rows simultaneously, which is extremely data parallel and SIMD friendly. Once the data is in local registers, you can access it any way you want (each register is effectively its own bank), which allows you to easily produce the im2col output stream and feed it straight to your matrix multiplication unit. You could have implemented the convolution directly, but then again you'd only get 50% utilization due to inflexibility.

they compare im2col performance with a GPU, while you don't need explicit im2col on a GPU