Is this a technical impossibility or just it hasn't been done yet? Could a library support generating intrinsics for a large set of architectures?

The full scope of what SIMD is used for is much larger than parallelizing evaluation of numeric types and algorithms.

For example, it is used for parallel evaluation of complex constraints on unrelated types simultaneously while packed into a single vector. Think a WHERE clause on an arbitrary SQL schema evaluated in full parallel in a handful of clock cycles. SIMD turns out to be brilliant for this but it looks nothing like auto-vectorization.

None of the SIMD libraries like Google Highway cover this case.

I don't quite get how something like highway doesn't cover this, while intrinsics do.

Can you explain the usecase more concretely?

Almost literally what I stated. Consider a row in Postgres table or similar. Convert the entire WHERE clause across all columns in that table into a very short sequence of SIMD instructions against the same memory. All of the columns, regardless of type, are evaluated simultaneously using SIMD. For many complex constraints you can match rows in single digit clock cycles even across many unrelated types. This is much faster than using secondary indexes in many cases.

It isn’t hypothetical, I’ve shipped systems that worked this way. You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth.

Google Highway gets mentioned in the article.

There is google’s highway, that provides an abstraction layer. It is used by NumPy.