In one view, the fact that it's a software problem is actually a weakness of (GPU) hardware design.

In the olden, serial computing days, our algorithms were standard, and CPU designers did all sorts of behind-the-scene tricks to improve performance without burdening software developers. It wasn't perfect abstraction, but they tried. Algorithm led the way; hardware had to follow.

CUDA threw that all away, exposed lots of ugly details of GPU hardware design that developers _had to_ take into account. This is why, for a long time, CUDA's primary customers (HPC community & Natl labs) refused to adopt CUDA.

It's interesting that now that CUDA has become a legitimate, widely adopted computing paradigm, how much our view on this has shifted.

You can still live your abstract, imperfect universe, there's nothing stopping you.

I don't believe you really can in GPU world. With CPU, if you ignore something important like cache hierarchy, the performance penalty is likely to be in double digits percentage. Something people can and do often ignore. With GPU, there are many many things (memory coalescing, warp, SRAM) that can have triple digits % of impact, hell maybe even more than that.