I've had the pleasure of working with some truly fast pieces of code written by experts. It's always both. You have to have a good sense of what's generally fast and what's not in order to design a system that doesn't contain intractable bottlenecks. And once you have a good design you can profile and optimize the remaining constraints.
But e.g. if you want to do fast math, you really need to design your pipeline around cache efficiency from the beginning – it's very hard to retrofit. Whereas reducing memory allocations in order to make parallel algorithms faster is something you can usually do after profiling.