> show that most packages show a slight (around 1%) performance improvement
This takes me back to arguing with Gentoo users 20 years ago who insisted that compiling everything from source for their machine made everything faster.
The consensus at the time was basically "theoretically, it's possible, but in practice, gcc isn't really doing much with the extra instructions anyway".
Then there's stuff like glibc which has custom assembly versions of things like memcpy/etc, and selects from them at startup. I'm not really sure if that was common 20 years ago but it is now.
It's cool that after 20 years we can finally start using the newer instructions in binary packages, but it definitely seems to not matter all that much, still.
It's also because around 20 years ago there was a "reset" when we switched from x86 to x86_64. When AMD introduced x86_64, it made a bunch of the previously optional extension (SSE up to a certain version etc) a mandatory part of x86_64. Gentoo systems could already be optimized before on x86 using those instructions, but now (2004ish) every system using x86_64 was automatically always taking full advantage of all of these instructions*.
Since then we've slowly started accumulating optional extensions again; newer SSE versions, AVX, encryption and virtualization extensions, probably some more newfangled AI stuff I'm not on top of. So very slowly it might have started again to make sense for an approach like Gentoo to exist**.
* usual caveats apply; if the compiler can figure out that using the instruction is useful etc.
** but the same caveats as back then apply. A lot of software can't really take advantage of these new instructions, because newer instructions have been getting increasingly more use-case-specific; and applications that can greatly benefit from them will already have alternative code-pathes to take advantage of them anyway. Also a lot of the stuff happening in hardware acceleration has moved to GPUs, which have a feature discovery process independent of CPU instruction set anyway.
The llama.cpp package on Debian and Ubuntu is also rather clever in that it's built for x86-64-v1, x86-64-v2, x86-64-v3, and x86-64-v4. It benefits quite dramatically from using the newest instructions, but the library doesn't have dynamic instruction selection itself. Instead, ld.so decides which version of libggml.so to load depending on your hardware capabilities.
> llama.cpp package on Debian and Ubuntu is also rather clever … ld.so decides which version of libggml.so to load depending on your hardware capabilities
Why is this "clever"? This is pretty much how "fat" binaries are supposed to work, no? At least, such packaging is the norm for Android.
> AVX, encryption and virtualization
I would guess that these are domain-specific enough that they can also mostly be enabled by the relevant libraries employing function multiversioning.
You would guess wrong.
Isn’t the whole thrust of this thread that most normal algorithms see little to no speed up from things like avx, and therefore multiversioning those things that do makes more sense that compiling the whole OS for a newer set of cpu features?
FWIW the cool thing about gentoo was the "use-flags", to enable/disable compile-time features in various packages. Build some apps with GTK or with just the command-line version, with libao or pulse-audio, etc. Nowadays some distro packages have "optional dependencies" and variants like foobar-cli and foobar-gui, but not nearly as comprehensive as Gentoo of course. Learning about some minor custom CFLAGS was just part of the fun (and yeah some "funroll-loops" site was making fun of "gentoo ricers" way back then already).
I used Gentoo a lot, jeez, between 20 and 15 years ago, and the install guide guiding me through partitioning disks, formatting disks, unpacking tarballs, editing config files, and running grub-install etc, was so incredibly valuable to me that I have trouble expressing it.
I still use Gentoo for that reason, and I wish some of those principles around handling of optional dependencies were more popular in other Linux distros and package ecosystems.
There's lots of software applications out there whose official Docker images or pip wheels or whatever bundle everything under the sun to account for all the optional integrations the application has, and it's difficult to figure out which packages can be easily removed if we're not using the feature and which ones are load-bearing.
I started with Debian on CDs, but used Gentoo for years after that. Eventually I admitted that just Ubuntu suited my needs and used up less time keeping it up to date. I do sometimes still pull in a package that brings a million dependencies for stuff I don't want and miss USE flags, though.
I'd agree that the manual Gentoo install process, and those tinkering years in general, gave me experience and familiarity that's come in handy plenty of times when dealing with other distros, troubleshooting, working on servers, and so on.
Someone has set up an archive of that site; I visit it once in a while for a few nostalgiac chuckles
https://www.shlomifish.org/humour/by-others/funroll-loops/Ge...
Nixpkgs exposes a lot of options like that. You can override both options and dependencies and supply your own cflags if you really want.
According to this[0] study of the Ubuntu 16.04 package repos, 89% of all x86 code was instructions were just 12 instructions (mov, add, call, lea, je, test, jmp, nop, cmp, jne, xor, and -- in that order).
The extra issue here is that SIMD (the main optimization) simply sucks to use. Auto-vectorization has been mostly a pipe dream for decades now as the sufficiently-smart compiler simply hasn't materialized yet (and maybe for the same reason the EPIC/Itanium compiler failed -- deterministically deciding execution order at compile time isn't possible in the abstract and getting heuristics that aren't deceived by even tiny changes to the code is massively hard).
Doing SIMD means delving into x86 assembly and all it's nastiness/weirdness/complexity. It's no wonder that devs won't touch it unless absolutely necessary (which is why the speedups are coming from a small handful of super-optimized math libraries). ARM vector code is also rather Byzantine for a normal dev to learn and use.
We need a more simple assembly option that normal programmers can easily learn and use. Maybe it's way less efficient than the current options, but some slightly slower SIMD is still going to generally beat no SIMD at all.
[0] https://oscarlab.github.io/papers/instrpop-systor19.pdf
Agner Fog's libraries make it pretty trivial for C++ programmers at least. https://www.agner.org/optimize/
The highway library is exactly the kind of a simpler option to use SIMD. Less efficient than hand written assembler but you can easily write good enough SIMD for multiple different architectures.
The sufficiently smart vectoriser has been here for decades. Cuda is one. Uses all the vector units just fine, may struggle to use the scalar units.
This should build a lot more incentive for compiler devs to try and use the newer instructions. When everyone uses binaries compiled without support for optional instruction sets, why bother putting much effort into developing for them? It’ll be interesting to see if we start to see more of a delta moving forward.
And application developers to optimize with them in mind?
I somehow have the memory that there was an extremely narrow time window where the speedup was tangible and quantifiable for Gentoo, as they were the first distro to ship some very early gcc optimisation. However it's open source software so every other distro soon caught up and became just as fast as Gentoo.
Would it make a difference if you compile the whole system vs. just the programs you want optimized?
As in, are there any common libraries or parts of the system that typically slow things down, or was this more targeting a time when hardware was more limited so improving all would have made things feel faster in general.