It takes a substantial amount of time when emitting lots of numbers in JSON, happens very commonly.

And this algorithm has low constant costs, and does not take dramatically more icache than the simple versions. There is no reason not to use this if your compile target can handle avx-512.

isn't avx-512 one of the more poorly supported set?

It's on every amd cpu from zen4 onwards, every remotely recent Intel server cpu, and now again on intel starting with nova lake this year.

In the future, it will be everywhere.