According to this[0] study of the Ubuntu 16.04 package repos, 89% of all x86 code was instructions were just 12 instructions (mov, add, call, lea, je, test, jmp, nop, cmp, jne, xor, and -- in that order).
The extra issue here is that SIMD (the main optimization) simply sucks to use. Auto-vectorization has been mostly a pipe dream for decades now as the sufficiently-smart compiler simply hasn't materialized yet (and maybe for the same reason the EPIC/Itanium compiler failed -- deterministically deciding execution order at compile time isn't possible in the abstract and getting heuristics that aren't deceived by even tiny changes to the code is massively hard).
Doing SIMD means delving into x86 assembly and all it's nastiness/weirdness/complexity. It's no wonder that devs won't touch it unless absolutely necessary (which is why the speedups are coming from a small handful of super-optimized math libraries). ARM vector code is also rather Byzantine for a normal dev to learn and use.
We need a more simple assembly option that normal programmers can easily learn and use. Maybe it's way less efficient than the current options, but some slightly slower SIMD is still going to generally beat no SIMD at all.
Agner Fog's libraries make it pretty trivial for C++ programmers at least. https://www.agner.org/optimize/
The highway library is exactly the kind of a simpler option to use SIMD. Less efficient than hand written assembler but you can easily write good enough SIMD for multiple different architectures.
The sufficiently smart vectoriser has been here for decades. Cuda is one. Uses all the vector units just fine, may struggle to use the scalar units.