What about duplicating the entire executable essentially a few times, and jumping to the right version at the very beginning of execution?
You have bigger binaries, but the logistics are simplified compared to shipping multiple binaries and you should get the same speed as multiple binaries with fully inlined code.
Since they don't seem to be doing that, my question is: what's the caveat I'm missing? (Or are the bigger binaries enough of a caveat by themselves?)
There's no need to do any of that, a table of function pointers to DSP functions works fine.
It can be useful to duplicate the entire code for 8-bit vs 10-bit pixels because that does affect nearly everything.
Ideally you only need to duplicate until you hit the first not-inlined function call; at that point there’s nothing gained and it’s just a waste of binary size.