Hacker News

Centralized inference is more economically efficient⁰, and should be cheaper for most users once competition squeezes the air out of token prices. It remains very valid for anyone who wants to maintain their privacy, ofc.

0: Because the only way to get cache locality out of a LLM is to batch invocations. A centralized system where the server handles thousands of invocations at the same time only needs a tiny fraction of the total memory throughput as having all of those invocations run locally on different machines would.