Function multiversioning would require indirect jumps/indirect calls, wouldn't it? Separate executables can do static jumps/calls.
Function multiversioning would require indirect jumps/indirect calls, wouldn't it? Separate executables can do static jumps/calls.
On linux it uses IFUNC resolved at load/dynamic relocation time, so at runtime it's the same cost as any other (relocatable) function call. But they're "static" in that it's not a calculated address so pretty easy for a superscaler CPU to follow.
So it does have some limitations like not being inlined, same as any other external function.
Since TEXTREL is basically gone these days (for good reasons!), IFUNC is the same as any other call that is relocatable to a target not in the same DSO. Which is either a GOT or PLT, either of which ends up being an indirect call (or branch if the compiler feels like it and the PLT isn’t involved). Which is what the person you’re replying to said :)
A relocatable call within the same DSO can be a PC-relative relocation, which is not a relocation at all when you load the DSO and ends up as a plain PC-relative branch or call.
Sure, but they're already paying that cost for every non-static function anyway. Any DSO, or executable that allows function interposition, already pays.
Ideally you should just multiversion the topmost exported symbol, everything below that should either directly inlined, or, as the architecture variant is known statically by the compiler, variants and a direct call generated. I know at least GCC can do this variant generation for things like constant propagation over static function boundaries, so /assume/ it can do the same for other optimization variants like this, but admittedly haven't checked.
What about duplicating the entire executable essentially a few times, and jumping to the right version at the very beginning of execution?
You have bigger binaries, but the logistics are simplified compared to shipping multiple binaries and you should get the same speed as multiple binaries with fully inlined code.
Since they don't seem to be doing that, my question is: what's the caveat I'm missing? (Or are the bigger binaries enough of a caveat by themselves?)
There's no need to do any of that, a table of function pointers to DSP functions works fine.
It can be useful to duplicate the entire code for 8-bit vs 10-bit pixels because that does affect nearly everything.
Ideally you only need to duplicate until you hit the first not-inlined function call; at that point there’s nothing gained and it’s just a waste of binary size.
Kenny green