I'm surprised people are surprised. Of course this is possible, and of course this is the future. This has been demonstrated already: why do you think we even have GPUs at all?! Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics. And these LLMs are practically the same math, it's all just obvious and inevitable, if you're paying attention to what we have, what we do to have what we have.
I believe this is a CPU/GPU vs ASIC comparison, rather than CPU vs GPU. They have always(ish) coexisted, being optimized for different things: ASICs have cost/speed/power advantages, but the design is more difficult than writing a computer program, and you can't reprogram them.
Generally, you use an ASIC to perform a specific task. In this case, I think the takeaway is the LLM functionality here is performance-sensitive, and has enough utility as-is to choose ASIC.
It reminds me of the switch from GPUs to ASICs in bitcoin mining. I've been expecting this to happen.
But the BTC mining algorithm has not and will not change. That’s the only reason ASICs atleast make a bit of sense for crypto.
AI being static weights is already challenged with the frequent model updates we already see - but may even be a relic once we find a new architecture.
We can expect the model landscape to consolidate some day. Progress will become slower, innovations will become smaller. Not tomorrow, not next year, but the time will come.
And then it'll increasingly make sense to build such a chip into laptops, smartphones, wearables. Not for high-end tasks, but to drive the everyday bread-and-butter tasks.
The world continues to evolve, in a way that requires flexibility - not more constraints. I just fail to see a future where we want less general purpose computers, and more hard-wired ones? Would be interesting to be proven wrong though!
Sounds to me like there’s potential to use these for established models to provide cost/scale advantage while frontier models will run in the existing setup.
IME llama et all require LoRA or fine-tuning to be usable. That's their real value vs closed source massive models, and their small size makes this possible, appealing, and doable on a recurring basis as things evolve. Again, rendering ASICs useless.
Read the blog post. It mentions that their chip has a small SRAM which can store LoRA.
Neither the blog nor Taalas' original post specify what speed to expect when using the SRAM in conjunction with the baked-in weights? To be taken seriously, that is really necessary to explain in detail, than a passing mention.
Heh, I said this exact thing in another thread the other day. Nice to see I wasn't the only one thinking it.
The middle ground here would be an FPGA, but I belive you would need a very expensive one to implement an LLM on it.
FPGAs would be less efficient than GPUs.
FPGAs don’t scale if they did all GPUs would’ve been replaced by FPGAs for graphics a long time ago.
You use an FPGA when spinning a custom ASIC doesn’t makes financial sense and generic processor such as a CPU or GPU is overkill.
Arguably the middle ground here are TPUs, just taking the most efficient parts of a “GPU” when it comes to these workloads but still relying on memory access in every step of the computation.
I thought it was because the number logic elements in a GPU is orders of magnitude higher than in a FPGA, rather than just processing speed. And GPU processing is inherently parallel so the GPU beats the FPGA just based on transistor count.
"This has been demonstrated already…"
I think burning the weights into the gates is kinda new.
("Weights to gates." "Weighted gates"? "Gated weights"?)
Is this not effectively the same thing as a Bitcoin ASIC?
Geights? Wates?
gweights
Not really new, this is 80’s-90’s Neuron MOS Transistor.
It’s also not that different than how TPUs work where they have special registers in their PEs for weights.
It's not certain this is the future: the obvious trade off is lack of flexibility: not only when a new model comes out, but also varying demand in the data centers - one day people want more LLM queries, another day more diffusion queries. Aaand, this blocks the holly grail of self improving models, beyond in-context learning. A realistic use case? More efficient vision based drone targeting in Ukraine/Taiwan/ whatevers next. That's the place where energy efficiency, processing speed, and also weight is most critical. Not sure how heavy ASICS are though, bit they should be proportional to the model size. I heard many complaints about onboard AI 'not being there yet', and this may change it. Not listing middle east as there is no serious jamming problem there.
In a not-too-distant future (5 years?) small LLMs will be good enough to be used as generic models for most tasks. And if you have a dedicated ASIC small enough to fit in an iPhone, you have a truly local AI device with the bonus point that you get something really new to sell in every new generation (i.e. acces to an even more powerful model)
The Taalas approach is much more expensive than the NPU that phones already have.
Yes but not in five years. The chips will be dirt cheap by then. We‘ll get “intelligent” washing machines that will discuss the amount of detergent and eventually berate us. Toasters with voice input. And really annoying elevators. Also bugs that keep an extremely low RF profile (only phoning home when the target is talking business).
No, Taalas requires more silicon which will always cost more than storing weights in DRAM.
it doesn’t need to go in the phone if it only takes a few milliseconds to respond and is cheap
Perceptible latency is somewhere between 10 and 100ms. Even if an LLM was hosted in every aws region in the world, latency would likely be annoying if you were expecting near-realtime responses (for example, if you were using an llm as autocomplete while typing). If, say, apple had an LLM on a chip any app could use some SDK to access, it could feasibly unlock a whole bunch of usecases that would be impractical with a network call.
Also, offline access is still a necessity for many usecases. If you have something like an autocomplete feature that stops working when you're on the subway, the change in UX between offline and online makes the feature more disruptive than helpful.
https://www.cloudping.co/
It does if you care about who can access to your tokens
The real benefit, to a very particular type of mind, is that the alignment will be baked in ( presumably a lot robust than today ) and wrongthink will be eliminated once and for all. It will also help flagging anyone, who would need anything as dangerous as custom, uncensored models. Win/win.
To your point, its neat tech, but the limitations are obvious since 'printing' only one LLM ensures further concentration of power. In other words, history repeats itself.
It doesn't have be to true for all models to be useful. Thinking about small models running on phones or edge devices deployed in the field that would be a perfect use case for a "printed model".
I'd be kind of shocked if Nvidia isn't playing with this.
I don't expect it's like super commercially viable today, but for sure things need to trend to radically more efficient AI solutions.
These are chips that become e-waste the second a better a model comes out, and nvidia is already limited by TSMC capacity.
This is a ridiculous mindset. Llama 3.1 8B can do lots of things today and it'll still be able to do those things tomorrow.
If you baked one of these into a smart speaker that could call tools to control lights and play music, it will still be able to do that when Llama 4 or 5 or 6 comes out.
If you pay $1,500 for a Mistral ASIC that is beaten by a $15 Qwen ASIC that comes out six months later, you'd be feeling pretty dang ridiculous.
I'm equally capable of making up numbers to support my perspective but I don't see the point.
The point is that the GP's mindset is not very ridiculous if you value things by a price/utility ratio. Software and hardware advancements will lead to buyer's remorse faster than people get an ROI from local inference.
SW and HW advancements will bring this topic in the "good enough for vast majority" field, thus making GP point moot. You don't care if your LLM ASIC chip is not the latest one because it works for the use you purchased it for. The highly dynamical nature of LLM itself will make part of the advantage of upgradable software not that interesting anymorw. [1]
[1] although security might be a big enough reason for upgrades to still be required
They'll be perfect for an appliance like the Rick and Morty butter robot.
these aren’t made for general chatbot use
Only in VC backed funding land.
In the real world, theres talking refrigerators who dont need to know how to recite shakespeare.
On the upside, Shakespeare isn't going to change soon.
So you're saying we should burn Shakespeare onto a chip? /s
Doesn't Google have custom TPUs that are kind of a halfway point between Taalas' approach and a generic GPU? I wonder if that kind of hardware will reach consumers. It probably will, though as I understand them NPUs aren't quite it.
Are people surprised?
I think the interesting point is the transition time. When is it ROI-positive to tape out a chip for your new model? There’s a bunch of fun infra to build to make this process cheaper/faster and I imagine MoE will bring some challenges.
> Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics.
We transitioned from software on CPUs to fixed GPU hardware... But then we transitioned back to software running on GPUs! So there's no way you can say "of course this is the future".
Job specific ASICs are are “old as time.”