This is quite good news but it’s worth remembering that it’s a rare piece of software in the modern scientific/numerical world that can be compiled against the versions in distro package managers, as versions can significantly lag upstream months after release.

If you’re doing that sort of work, you also shouldn’t use pre-compiled PyPi packages for the same reason - you leave a ton of performance on the table by not targeting the micro-architecture you’re running on.

My RSS reader trains a model every week or so and takes 15 minutes total with plain numpy, scikit-learn and all that. Intel MKL can do the same job in about half the time as the default BLAS. So you are looking at a noticeable performance boost but zero bullshit install with uv is worth a lot. If I was interested in improving the model than yeah I might need to train 200 of them interactively and I’d really feel the difference. Thing is the model is pretty good as it is and to make something better I’d have to think long and hard about what ‘better’ means.

Out of interest, what reader is this? Sounds interesting

Most of the scientific numerical code I ever used had been in use for decades and would compile on a unix variant released in 1992, much less the distribution version of dependencies that were a year or two behind upstream.

Very true, but a lot of stuff builds on a few core optimized libraries like BLAS/LAPACK, and picking up a build of those targeted at a modern microarchitecture can give you 10x or more compared to a non-targeted build.

That said, most of those packages will just read the hardware capability from the OS and dispatch an appropriate codepath anyway. You maybe save some code footprint by restricting the number of codepaths it needs to compile.

I mean that’s just lucky and totally depends on your field and what is normal - just as an example, we used the LLNL SUNDIALS package for implicit time integration. On Ubuntu 24.04 the latest version is 6.4.1 where the latest published is v7.5.0. We found their major version releases tended to require changes.

There’s also the difference between being able to run and being able to run optimised. At least 5 years ago, the Ubuntu/Debian builds of FFTW didn’t include the parallelised OpenMP library.

In a past life I did HPC support and I recommend the Spack package manager a lot to people working in this area because you can get optimised builds with whatever compiler tool chain and options you need quite easily that way.

Yup, if you're using OpenCV for instance compiling instead of using pre-built binaries can result in 10x or more speed-ups once you take into account avx/threading/math/blas-libraries etc...

Yup. The irony is that the packages which are difficult to build are the ones that most benefit from custom builds.

Thanks for sharing this. I'd love to learn more about micro-architectures and instruction sets - would you have any recommendations for books or sources that would be a good starting place?

My experience is mostly practical really - the trick is to learn how to compile stuff yourself.

If you do a typical: "cmake . && make install" then you will often miss compiler optimisations. There's no standard across different packages so you often have to dig into internals of the build system and look at the options provided and experiment.

Typically if you compile a C/C++/Fortran .cpp/.c/.fXX file by hand, you have to supply arguments to instruct the use of specific instruction sets. -march=native typically means "compile this binary to run with the maximum set of SIMD instrucitons that my current machine supports" but you can get quite granular doing things like "-march=sse4,avx,avx2" for either compatibility reasons or to try out subsets.

I wonder who downvoted this. The juice you are going to get from building your core applications and libraries to suit your workload are going to be far larger than the small improvements available from microarchitectural targeting. For example on Ubuntu I have some ETL pipelines that need libxml2. Linking it statically into the application cuts the ETL runtime by 30%. Essentially none of the practices of Debian/Ubuntu Linux are what you'd choose for efficiency. Their practices are designed around some pretty old and arguably obsolete ideas about ease of maintenance.