> It is also a bit weird that they are not incorporating speculative decoding
Wouldn’t speculative decoding decrease overall throughput, but optimise (perceived) responsiveness?
> It is also a bit weird that they are not incorporating speculative decoding
Wouldn’t speculative decoding decrease overall throughput, but optimise (perceived) responsiveness?
For compute bound region(high batch size) yes, but for low batch size it could improve the throughput.