I guess performance would be very different if things were initially assumed to run on a cpu

I think it could be improved a lot by niche optimization passes on the codegen backend. Kinda like the autovectorization and similar optimizations on the CPU backends.