It's not certain this is the future: the obvious trade off is lack of flexibility: not only when a new model comes out, but also varying demand in the data centers - one day people want more LLM queries, another day more diffusion queries. Aaand, this blocks the holly grail of self improving models, beyond in-context learning. A realistic use case? More efficient vision based drone targeting in Ukraine/Taiwan/ whatevers next. That's the place where energy efficiency, processing speed, and also weight is most critical. Not sure how heavy ASICS are though, bit they should be proportional to the model size. I heard many complaints about onboard AI 'not being there yet', and this may change it. Not listing middle east as there is no serious jamming problem there.
In a not-too-distant future (5 years?) small LLMs will be good enough to be used as generic models for most tasks. And if you have a dedicated ASIC small enough to fit in an iPhone, you have a truly local AI device with the bonus point that you get something really new to sell in every new generation (i.e. acces to an even more powerful model)
The Taalas approach is much more expensive than the NPU that phones already have.
Yes but not in five years. The chips will be dirt cheap by then. We‘ll get “intelligent” washing machines that will discuss the amount of detergent and eventually berate us. Toasters with voice input. And really annoying elevators. Also bugs that keep an extremely low RF profile (only phoning home when the target is talking business).
No, Taalas requires more silicon which will always cost more than storing weights in DRAM.
it doesn’t need to go in the phone if it only takes a few milliseconds to respond and is cheap
Perceptible latency is somewhere between 10 and 100ms. Even if an LLM was hosted in every aws region in the world, latency would likely be annoying if you were expecting near-realtime responses (for example, if you were using an llm as autocomplete while typing). If, say, apple had an LLM on a chip any app could use some SDK to access, it could feasibly unlock a whole bunch of usecases that would be impractical with a network call.
Also, offline access is still a necessity for many usecases. If you have something like an autocomplete feature that stops working when you're on the subway, the change in UX between offline and online makes the feature more disruptive than helpful.
https://www.cloudping.co/
It does if you care about who can access to your tokens
The real benefit, to a very particular type of mind, is that the alignment will be baked in ( presumably a lot robust than today ) and wrongthink will be eliminated once and for all. It will also help flagging anyone, who would need anything as dangerous as custom, uncensored models. Win/win.
To your point, its neat tech, but the limitations are obvious since 'printing' only one LLM ensures further concentration of power. In other words, history repeats itself.
It doesn't have be to true for all models to be useful. Thinking about small models running on phones or edge devices deployed in the field that would be a perfect use case for a "printed model".