Hacker News

My argument is predicated on the assumption that mainstream hardware manufacturers will copy the way Apple and Framework have made system memory usable for inference.

In that world, a) we are already at or close to having enough memory in local devices to do inference locally, and b) that memory isn't inference-specific and can be utilized for other things. Most devices come with enough memory to do some level of inference, and some come with plenty (eg a gaming desktop probably has 32GB+ of RAM in it).

You aren't going to run Kimi on it, but I think the reality for a lot of consumer inference is that it doesn't need to be. It's going to be a lot of things that are soft, and easily answered by a search API, so the LLM really just needs to be able to skim and summarize. Going a step further, we may even see some kind of hybrid approach where a local OpenRouter kind of thing decides whether the task is soft enough to do locally with models that fit in RAM or if it needs to be farmed out to a PaaS provider.