Hacker News

About running models locally and why data centers win (for now): they can stream the model weights to many neural engines at the same time, so each of these only needs enough RAM to hold the KV cache. So each engine is cheaper to operate, plus they are time-shared, resulting in massive wins for data centers.

So one can see businesses owning their own such cluster, next to their database infra, in the near future.