Hacker News

> No one is going to run models that are comparable to frontier locally without spending enormous sums for use at scale

You can always run these models cheaper locally if you're willing to compromise on total throughput and speed of inference. For most end-user or small-scale business needs, you don't really need a lot of either.

9dev 6 hours ago [ - ]

It would be awful if running models locally became the primary way of using LLMs. On dedicated servers sharing GPUs across requests, energy usage and environmental impact is way lower overall than if everyone and their mother suddenly needs beefy GPUs. It’s the equivalent of everyone commuting alone in their own car instead of a train picking up hundreds at once.

zozbot234 6 hours ago [ - ]

You can batch requests when running locally too, if you're using a model with low-enough requirements for KV-cache; essentially targeting the same resource efficiencies that the big providers rely on. This is useful since it gives you more compute throughput "for free" during decode, even when running on very limited hardware.

9dev 20 minutes ago [ - ]

That’s still orders of magnitude less efficient, and also not how most people use AI, or probably will use AI.

amelius 5 hours ago [ - ]

It's even more awful if the compute capital is owned by only a handful of players.

duskdozer 4 hours ago [ - ]

Maybe people would target their use more appropriately, then.

9dev 20 minutes ago [ - ]

Just like people would drive their car as little as possible out of concern for the environment..?