This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.
[1] https://codegolf.stackexchange.com/questions/215216/high-thr...
I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.
Check out cpp at 208.3 GiB/s, 3x faster than asm.
Yeah, because (and here's the trick) they are clever and do less work.
Optimizing things usually means "think of a way to do the same thing with less effort".
Hire the laziest programmer :)