In a not-too-distant future (5 years?) small LLMs will be good enough to be used as generic models for most tasks. And if you have a dedicated ASIC small enough to fit in an iPhone, you have a truly local AI device with the bonus point that you get something really new to sell in every new generation (i.e. acces to an even more powerful model)
The Taalas approach is much more expensive than the NPU that phones already have.
Yes but not in five years. The chips will be dirt cheap by then. We‘ll get “intelligent” washing machines that will discuss the amount of detergent and eventually berate us. Toasters with voice input. And really annoying elevators. Also bugs that keep an extremely low RF profile (only phoning home when the target is talking business).
No, Taalas requires more silicon which will always cost more than storing weights in DRAM.
it doesn’t need to go in the phone if it only takes a few milliseconds to respond and is cheap
Perceptible latency is somewhere between 10 and 100ms. Even if an LLM was hosted in every aws region in the world, latency would likely be annoying if you were expecting near-realtime responses (for example, if you were using an llm as autocomplete while typing). If, say, apple had an LLM on a chip any app could use some SDK to access, it could feasibly unlock a whole bunch of usecases that would be impractical with a network call.
Also, offline access is still a necessity for many usecases. If you have something like an autocomplete feature that stops working when you're on the subway, the change in UX between offline and online makes the feature more disruptive than helpful.
https://www.cloudping.co/
It does if you care about who can access to your tokens