Hacker News

> It is also a bit weird that they are not incorporating speculative decoding

Wouldn’t speculative decoding decrease overall throughput, but optimise (perceived) responsiveness?

For compute bound region(high batch size) yes, but for low batch size it could improve the throughput.