It's also because around 20 years ago there was a "reset" when we switched from x86 to x86_64. When AMD introduced x86_64, it made a bunch of the previously optional extension (SSE up to a certain version etc) a mandatory part of x86_64. Gentoo systems could already be optimized before on x86 using those instructions, but now (2004ish) every system using x86_64 was automatically always taking full advantage of all of these instructions*.

Since then we've slowly started accumulating optional extensions again; newer SSE versions, AVX, encryption and virtualization extensions, probably some more newfangled AI stuff I'm not on top of. So very slowly it might have started again to make sense for an approach like Gentoo to exist**.

* usual caveats apply; if the compiler can figure out that using the instruction is useful etc.

** but the same caveats as back then apply. A lot of software can't really take advantage of these new instructions, because newer instructions have been getting increasingly more use-case-specific; and applications that can greatly benefit from them will already have alternative code-pathes to take advantage of them anyway. Also a lot of the stuff happening in hardware acceleration has moved to GPUs, which have a feature discovery process independent of CPU instruction set anyway.

The llama.cpp package on Debian and Ubuntu is also rather clever in that it's built for x86-64-v1, x86-64-v2, x86-64-v3, and x86-64-v4. It benefits quite dramatically from using the newest instructions, but the library doesn't have dynamic instruction selection itself. Instead, ld.so decides which version of libggml.so to load depending on your hardware capabilities.

> llama.cpp package on Debian and Ubuntu is also rather clever … ld.so decides which version of libggml.so to load depending on your hardware capabilities

Why is this "clever"? This is pretty much how "fat" binaries are supposed to work, no? At least, such packaging is the norm for Android.

> AVX, encryption and virtualization

I would guess that these are domain-specific enough that they can also mostly be enabled by the relevant libraries employing function multiversioning.

You would guess wrong.

Isn’t the whole thrust of this thread that most normal algorithms see little to no speed up from things like avx, and therefore multiversioning those things that do makes more sense that compiling the whole OS for a newer set of cpu features?

[deleted]