This is very cool to see - seems like soooo much efficiency waiting to be unlocked at the chip level.
What's everyone think of Taalas?
They're actually burning the LLM model into the silicon, with some onboard memory for fine-tuning. They claim huge cost / latency wins.
Super fast demo live at: https://chatjimmy.ai/
https://www.reddit.com/r/singularity/comments/1r9frzk/taalas...
Their demo is almost unbelievably fast, but as I understand it, the limitation of Taalas's strategy is KV-cache. This grows with context length, so either needs to be stored in SRAM (small) or streamed in (slow). Even for a tiny model like the Llama 8B they have in their demo, the KV cache will be ~64kb per token at 8-bit quantization, so at a 1,000-token sequence length you are already at 64MB of SRAM for a single user. This is probably why their demo only lets you generate 1,000 tokens: they can't go beyond that without slowing down inference.
So I'm curious what their strategy is. It seems to me that the options are: 1. Target smaller usecases that can live with a tiny context window 2. Use huge amounts of SRAM (at which point they look like Groq or Cerebras) 3. Make it up with extreme KV-cache compression/quantization 4. Run linear-attention/sliding window attention models
Other commenters have mentioned robotics as a potential application, which sounds interesting.
> seems like soooo much efficiency waiting to be unlocked at the chip level
Well if you are exclusively using GPUs that are general purpose, of course you leave so much efficiency on the table. That’s why Google started making TPUs more than a decade ago. I remember that kerfuffle when Google fired Timnit Gebru when Gebru’s paper used GPUs to calculate the environment impact of LLMs while ignoring the efficiency of TPUs; this basically made Jeff Dean very angry due to that wide efficiency gap.
These NVIDIA GPUs aren't general purpose in the way that you think. They can't even run games. Nvidia blackwell is probably slightly more efficient than TPUs for training. Do you really expect a 4 trillion company with the majority of its revenue being AI for some years now, not to have built its flagship product fully around AI? The GPU name stuck around, but they are pretty terrible at graphics.
The real efficiency win in these chips is that they are made for inference only. You can throw away the vast majority of a chip if you only need a few ops, a single precision (like INT8 or FP8) and don't need ultra fast interconnects.
That ... wasn't the kerfuffle
She wrote the stochastic parrots paper.
Google’s internal review blocked it from publication. Stated reasons were about paper quality. You can speculate whether that was the real reason.
Gebru issued an ultimatum email and said she would resign if some list of conditions weren’t met.
Google said “thanks, we accept your resignation”.
She claims it is retaliation, but it seems more like an own-goal if you ask me. She basically handed Google the solution to their problem.
Practical lesson: don’t tell your employer you might quit before you’re ok with leaving.
It kind of was. I really hate gaslighting, but GP is not inaccurate. Google claimed it did not meet their bar for publication because it ignored recent research on how to reduce the environmental and bias-related risks of LLMs. On the other hand, a large org is unlikely to subsidize high-profile research that makes it look bad. And Gebru was critical of Google’s internal culture and diversity efforts…
I haven't read any of these papers, but given the environmental impact of LLMs in 2026, it seems like Timnit Gebru has been thoroughly vindicated...
It'd be cool to see more of this type of thing, but I have to imagine the ability for it to be updated to a brand-new model as new models come out is limited. If that is the case, it's going to be an extremely hard sell.
> extremely hard sell.
It really depends on the pricepoint at which they can get a board. If they can do a ~32B model for 1k$ and a size of an external HDD, I'd buy one now, even knowing that it won't be upgradeable / the model remains fixed. The speeds they've shown are a quality of its own, and there's plenty you can do with such a model and faster than instant responses.
Maybe in 10 years when the tech matures, but IMO now seems a bit too early to have a tech like this. It is like intelligence without evolution or progress.. yes it can be used in some niche markets, but difficult to be generic.
There are plenty of applications that would be useful right now. Specialized models for tool use, like fine tuning for command line tools that are already well established and don’t change often. I’m sure there are many areas where the training data is essential crystallized and unlikely to change. Think of them more like delegated agents or coprocessors that another model could route to so instead of routing to a quantized or lesser model it could use a full fidelity model that is faster, almost instantaneous.
If performance per watt is 100x better than GPUs (as GP link claims) then I don't think it's a hard sell at all. That's actually a cost reduction that matters.
You don't need SOTA models for all tasks, and being able to do more routine tasks at something like 10% of the cost and 70x speed unlocks LLM use for things that are just unthinkable now (bulk classification tasks, real time speech interaction, etc)
A hard sell right now. The rate of change will slow down
Yes, but with current architectures world knowledge is baked into the weights. We might stop figuring out how to make models better, but the world keeps changing, science is going to keep making progress at understanding the world, etc. This creates a significant minimum rate of change and I'm pretty skeptical that it's worth baking weights into silicon as a result.
I think it would just be an opportunity to sell another chip a few years down the line. If the utility curve flattens out on the performance of models I can see a future where you are buying an up to date chip every few years to upgrade to the latest and greatest, while providing up to date context as part of the user input. Like if I have a programming task and I supply a copy of up-to-date documentation alongside my input, I would think that I could still get good output out of a dated model.
That's why we have reasoning/CoT LLMs that can use tools to get updated information.
This already isn't the case for the popular models. The knowledge baked into the weights tells the model how to talk and reason, but for world knowledge they do a web search right off the bat most of the time.
I mean it just depends on the price of the chip. You might just replace the chip like you would any other component. Like a video game cartridge or something.
What makes you think that? The rate of change seems to have been increasing and there is so many chip and model best in different directions at the moment.
I think the model they chose is out of date and hard to sell, but there are plenty of use cases where today's dumb small models are fine. A Qwen 3.5/3.6 or Gemma 3 model on silicon at those speeds would be genuinely world changing even if it's only 1-3B params. Such a model at those speeds will remain extremely useful even over a 5-6 year timespan, I think.
If you consider the places you could deploy it -- with no network access, and at those high speeds... very useful .. for adding vague "common sense" fuzzy thinking to all kinds of applications that right now piss consumers off with poor UX. Esp if the model can do voice-to-text and text-to-speech well (some of the smaller models can)
I wouldn't be surprised if "fast, cheap, dumb" end us being the market for LLMs.
The state-of-the-art models aren't at "can fully replace knowledge worker" levels yet and I doubt they'll get there any time soon, so charging $2000 / month for access isn't going to happen. Right now everyone and their dog is being handed subsidized credits to play with AI, but the actual outcome is rarely good enough to be worth the money they'd need to charge for it. It might very well take another order of magnitude or two to get LLMs to be truly good (if it is even possible at all), and considering how much money is already being pumped into it I just don't see that happening.
On the other hand, the dumb models are more than adequate for simple noncritical tasks, like directing a user to the appropriate FAQ entry, or playing phone decision tree. There's a lot of money in making chatbot assistants actually useful, or in augmenting website search. Turning it into a glorified "language-to-API-call" translator doesn't take a lot of smarts, but as long as it's cheap you can make a killing in volume.
> On the other hand, the dumb models are more than adequate for simple noncritical tasks, like directing a user to the appropriate FAQ entry
This is a lane I’ve been experimenting in —- seeing what I can get out of models that work in 16GB VRAM for simple tasks (screen scraping, decision tree navigation, natural language queries). It’s interesting for sure (certainly reveals non-deterministic limits) and promising for low criticality review-opportunity tasks, but I also feel like I need better sources/community for understanding and reflection. Preferably those that aren’t hype channels. Any pointers?
> I think the model they chose is out of date and hard to sell
I understood it as a proof-of-concept, not a for-mass-production single blueprint - i.e.: "if you need your NN in a CIM form on ASIC, we can do it".
Their next proof-of-concept was said to be meant to be about size: "we showed you we can do it with 8b, now we are working to show you we can do 24b or 32b". Then, "and we plan to go bigger and faster".
> Our second model, still based on Taalas’ first-generation silicon platform (HC1), will be a mid-sized reasoning LLM. It is expected in our labs this spring and will be integrated into our inference service shortly thereafter. // Following this, a frontier LLM will be fabricated using our second-generation silicon platform (HC2). HC2 offers considerably higher density and even faster execution. Deployment is planned for winter (19 Feb 2006)
In a chatbot, 17k tok/s is a neat but nearly useless showcase. In a coding agent it is a meaningful improvement. In robotics, it could be an absolute revolution.
8B models aren't useful in general, but for specific use cases they can provide an enourmous amount of intelligence - nVidia's Tesla/Waymo competitor is a 7B LLM with a 2B diffusion model, and running that at those speeds could be an order of magnitude cheaper than existing solutions.
17K tok/s is approaching realtime motor cortex needs for a robot with ~12 actuators (bipedal humanoid) and an IMU. I don't know how many parameters a motor cortex would need but 8B feels like it is within 2 orders of magnitude.
this is an LLM, not a motor cortex. it will output commands as text (json, ...), so comparing size is not very meaningful, especially considering neurons are highly complex and likely requires thousands of artificial simple neurons (weight+bias)
There's nothing about Taalas that is specific to an LLM
Bumping the speed of these things would be more than meaningful. It would be a massive game changer.
I assert like 80% of this “multi agent parallel workflow” business is simply a workaround to models being soooooo slow. Like as the dude driving these things… you kick it off and twiddle your thumbs waiting minutes to hours sometimes for all the inference and token generator to finish. So you dispatch multiple workstreams in parallel to be more efficient.
I assert that if the model was even 10x faster we’d be using these things radically different. You’d be doing things that are currently time prohibitive. At 100x, holy shit will software dev get crazy. You’d be kicking off hundreds of parallel workers attacking a problem from every angle and stuff. Who even knows!!!
And the thing is, 10x will absolutely come and probably even 100x. And it will be sold like a video game cartridge or something depending on how the actual model gets “baked” into the hardware. No remote inference at all.
Could you give me some example how in robotics it can be an absolute revolution?
My understanding is that robotics doesn't really rely much on LLM's in the first place but rather other things.
Is the thing that you are suggesting that it would ingest all real time data and then reason through it at an incredibly fast speed and then act on it and re-iterate? I might imagine some problems with this though I am not a robotics engineer and perhaps someone who deeply understands this topic can give more information.
LLM are very good at looking at images and reasoning about them. much more than just object recognition/segmentation, they can explain the physics in the image, the intents, plan actions, ...
That's because of posttraining optimizing for benchmarks that test that.
They tend to collapse into nonsense and hallucinations pretty quickly if you move slightly out of the envelope of the current visual reasoning benchmaxxing.
Disclaimer: I'm a robotics noob, but I've been working on robotics for a few months now.
I'd say virtually all robots you've seen in the real world today rely on classical approaches - you build a rudimentary map, then use classical algorithms to find paths/do area coverage. The robots do no reason or understand what they're looking for, they're more like in-game units. At most there's some bounded, lightweight image classification going on.
LLMs can understand and reason about the world natively. nVidia has a Tesla FSD/Waymo competitor which simply their 7B reasoning LLM but instead of outputting tokens directly, its outputs are fed to a 2B diffusion model that outputs 1.6 second long trajectory for the car, and this is enough for an L2 system. But to make this work, they need the model to run at 10Hz, so they use super high-end hardware to do it (Jetson Thor) and the car is still "blind" for 100ms at a time (they have a parallel classical safety system).
With on-chip LLMs you could run this loop at like 100Hz on a chip that costs a few hundred bucks, rather than 10Hz on a board that costs several thousand.
Low latency is nice. But it would be more interesting if they could demonstrate the efficiency of energy consumption.
Tokens/seconds and watt-hours seem related?
It seems technically interesting, but they seem very sparse on details. I don't know if I like the idea of a single unchanging model forever on a chip. How much more expensive would the silicon be if they used rewritable ROM for the weights? Such an arrangement would permit fine-tunes of the model it was designed for, which might minimize concerns about the model becoming outdated.
There is no memory storage of weights in the Taalas cards but translation of the weight multiplier into a circuit.
I think hardware like this is the future for LLM-providers once we reach a point where the models aren't advancing much any more. You could argue we're close now.
The hyperscalers like AWS will made great use of these to serve up models that will be relevant for several years. But right now, we're still seeing significant bumps in model quality every couple of months - especially with open-weight models like Deepseek/Kimi/GLM.
Until that point, though, I don't see how this is ever going to be cost effective vs general purpose hardware.
I also think we'll see miniature versions of this baked into mobile hardware for super fast and efficient on-device LLMs.
I see only these two possibilities:
1. If LLMs keep improving, burning models onto silicon becomes obsolete too fast and is not worth doing. Outcome: We keep getting better LLMs. 2. If LLM improvements slow down, they will be burned onto silicon. Outcome: We get faster, cheaper and energy-efficient LLMs.
Either way sounds great to me. It will certainly be a mix so we can even get both.