I'm not sure if that's what you were going for, but I read it as if it were written by The Board in the game Control, and found myself with the appropriate level of existential dread.
Because there's been nothing to discuss since their announcement. Their API access immediately closed due to overwhelming demand and they didn't fab newer models than Llama3 yet.
Probably they will make bank selling to HFT for a while.
Funnily enough, pasting your comment straight into Jimmy leads to a... Funnily suboptimal answer that does not answer the question.
As someone else already contributed, this is driven by a Canadian startup taalas that basically makes chips that are llms, so everything is very fast but also, baked into the chip. Once this kind of stuff is a commodity in like 10 years, our world will be very, very different.
Taalas HC1 AI uses Llama 3.1 8B, but takes up a massive 53B transistors and 815mm2 on TSMC N6 (nearly at the reticle limit of 858mm2). N2 is a little less than 3x as dense (110MTr/mm2 vs 313MTr/mm2).
This chip would still be 272mm2 on N2 which is an eye-watering $30k/wafer and bigger than a 9950x or Nvidia 5070.
This just isn't feasible. Some of the latest-gen LLMs seem to have 5-10T parameters or about 1000x more. I don't know that taping out just one chip makes economic sense let alone the 300-1000 chips required for a cutting-edge model. Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips.
There are a TON of uses for an 8B parameter models on the edge, but this is WAY too big to put on the edge of anything. Something like a 10mm2 100m parameter voice model might be feasible on the edge, but only for expensive devices, but most of those are TSMC 28nm (up to 29MTr/mm2) or GF FDX22 (up to 40MTR/mm2) which would increase the AI chip to the point where it would absolutely dominate the BOM.
> Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips.
They probably have a few ideas around that. Me, personally, I'd have one main expensive chip (replaced every 10 years, or whatever), with a secondary cheap chip in front of it that gets replaced every year or so.
The secondary chip could act the way RAG does, or perhaps both chips together can act as LoRA.
Either way, 99.999% of the knowledge is static, you just need to fine-tune the weights with that remaining 0.001% knowledge, which can be done using RAG or LoRA on a much smaller (thus cheaper) disposable chip.
The better solution would be making part of the chip cluster use something like FPGA which can be reprogrammed.
Text to speech or diagnostics equipment where the core model is relatively small and never changes seems like the ideal application. You might be able to fit something in the 25-30B range in 2nm to 14A, but it would need a way to update.
Large models are simply out of the question in my opinion. If you need 400+ different chip designs, it’ll be billions of dollars to tape out before you even make the first chip.
> The better solution would be making part of the chip cluster use something like FPGA which can be reprogrammed.
I'm not sure I follow (It's late, I am tired and I haven't had my dinner yet. That's my stupid trifecta!)
The original chip has the weights, so it's literally just a bunch of on-die (read-only) memory cells. The FPGA, while you could use it for the memory cells, would be way too expensive to use as pure memory. Typically one would hook up (read-only) storage to it, so you still need that read-only chip anyway.
The FPGA is just the compute bits, but this chip has on-die weights, not just compute.
I was proposing that the they have the base weights on a primary (permanent) chip, and have a secondary (replaceable) smaller chip with weights for a specific use-case, or for fine-tuning with new knowledge/updates to the model.
The matrices can be multiplied LoRA style, applying the matrix in the secondary chip to the primary chip, resulting in up-to-date weights through which the prompt is pushed.
Yeah, they're clearly just starting out and just shipped their very first proof of concept. But to me, their plans seem generally reasonable https://taalas.com/the-path-to-ubiquitous-ai/, and like I wrote, if this kind of thing succeeds and could become some kind of cheaply producible commodity component, I think there's huge value in that. Alas, maybe not as a frontier model replacement, but say 10 years from now you can drop a cheap raspberry pi like device in your Lan and have a fast local engine for things like text sentiment analysis, text summarisation, voice recognition, basic vision and things like that, that would be pretty exciting to me (but maybe as you outlined, impossible in practice)
There is a reasonable kernel of an idea here, but only if you dial expectations WAY back. The 10 years speculation is just wrong though. Even in 10 years, their 8B param model isn't going to be in consumer devices.
6nm is just 7nm++ and the process will be a decade old in a few months. In the decade since, we've only had a slightly less than 3x increase in transistor density and that's including EUV, BSPD, and GAAFET (which means progress is likely going to slow down even more).
Even if we hit another 3x increase, their 815mm2 design will still be a bit over 90mm2. For comparison, the entire M5 Pro/Max CPU die is just 61.7nm.
If our current progress somehow holds (not likely), even 20 years from now the 8B model would be 30mm2. You need 30 years of dead consistent progress to get it down to an includable 10mm2.
As you can see, this doesn't make sense to invest in. As to the stuff like voice recognition or basic vision, these can often fit within 100m parameter models which would be around 10mm2 on their current 6nm design. That's doable today in custom edge computing devices.
The other possible use is cheap fallback models for AI companies. Moving to N2 and shrinking chips to 600mm2 to improve yields a bit would give about 50B parameters with 3 chips plus another FPGA-ish programmable chip for continuing training and interconnects for everything. You'd need hundreds of thousands of chips produced for that exact AI model just to get costs below $100,000 per board.
That seems like a lot of money for the AI model you are essentially giving away, but maybe it still beats the power and price of GPU server racks.
The government isn't going to be making chip fabs go any faster which is the biggest limitation here.
The second big issue is that it takes months to fab chips meaning your hardware AI is months to maybe a year or more behind the times when it lands.
I do think it makes sense for something like a medical scanner where the model simply doesn't need constant updates, but that doesn't need government involvement to ship.
This is genuinely confusing to my senses. The future is going to be so strange/neat/me unemployed.
> strange/neat/me unemployed
I'm not sure if that's what you were going for, but I read it as if it were written by The Board in the game Control, and found myself with the appropriate level of existential dread.
We love/help/replace you
and I haven't played that game, so I read it in Ralph Wiggum's voice.. which also feels appropriate.
I'm in danger.
The future is totally illegible to me. I love these AI models, but I feel like I'm going to be jobless within 10 years.
Anomie is at an all time high right now.
10 years? An optimist, I see.
Yeah. It keeps catching me off guard that it answered me already.
Why is the insane speed of 13KTPS of this site is not more on the the top of the AI conversations?
Because there's been nothing to discuss since their announcement. Their API access immediately closed due to overwhelming demand and they didn't fab newer models than Llama3 yet.
Probably they will make bank selling to HFT for a while.
It's pretty well known by now.
I asked it for a block of C++ code and it hit 14,189 tok/s. I assume it cached someone else's session?
No - it's custom silicon https://news.ycombinator.com/item?id=48693490
Because I just tested it and it took 3-4 clarifications before it actually gave a correct response vs gemini/google search. It's not great, but good.
I'd rather wait 3x as long.
This caused me to have some sense what blistering fast AI actually is. What it means for the future is a question that remains.
Wow.. what?! How is this so fast?! Where can I read more?
Funnily enough, pasting your comment straight into Jimmy leads to a... Funnily suboptimal answer that does not answer the question.
As someone else already contributed, this is driven by a Canadian startup taalas that basically makes chips that are llms, so everything is very fast but also, baked into the chip. Once this kind of stuff is a commodity in like 10 years, our world will be very, very different.
Taalas HC1 AI uses Llama 3.1 8B, but takes up a massive 53B transistors and 815mm2 on TSMC N6 (nearly at the reticle limit of 858mm2). N2 is a little less than 3x as dense (110MTr/mm2 vs 313MTr/mm2).
This chip would still be 272mm2 on N2 which is an eye-watering $30k/wafer and bigger than a 9950x or Nvidia 5070.
This just isn't feasible. Some of the latest-gen LLMs seem to have 5-10T parameters or about 1000x more. I don't know that taping out just one chip makes economic sense let alone the 300-1000 chips required for a cutting-edge model. Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips.
There are a TON of uses for an 8B parameter models on the edge, but this is WAY too big to put on the edge of anything. Something like a 10mm2 100m parameter voice model might be feasible on the edge, but only for expensive devices, but most of those are TSMC 28nm (up to 29MTr/mm2) or GF FDX22 (up to 40MTR/mm2) which would increase the AI chip to the point where it would absolutely dominate the BOM.
> Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips.
They probably have a few ideas around that. Me, personally, I'd have one main expensive chip (replaced every 10 years, or whatever), with a secondary cheap chip in front of it that gets replaced every year or so.
The secondary chip could act the way RAG does, or perhaps both chips together can act as LoRA.
Either way, 99.999% of the knowledge is static, you just need to fine-tune the weights with that remaining 0.001% knowledge, which can be done using RAG or LoRA on a much smaller (thus cheaper) disposable chip.
The better solution would be making part of the chip cluster use something like FPGA which can be reprogrammed.
Text to speech or diagnostics equipment where the core model is relatively small and never changes seems like the ideal application. You might be able to fit something in the 25-30B range in 2nm to 14A, but it would need a way to update.
Large models are simply out of the question in my opinion. If you need 400+ different chip designs, it’ll be billions of dollars to tape out before you even make the first chip.
> The better solution would be making part of the chip cluster use something like FPGA which can be reprogrammed.
I'm not sure I follow (It's late, I am tired and I haven't had my dinner yet. That's my stupid trifecta!)
The original chip has the weights, so it's literally just a bunch of on-die (read-only) memory cells. The FPGA, while you could use it for the memory cells, would be way too expensive to use as pure memory. Typically one would hook up (read-only) storage to it, so you still need that read-only chip anyway.
The FPGA is just the compute bits, but this chip has on-die weights, not just compute.
I was proposing that the they have the base weights on a primary (permanent) chip, and have a secondary (replaceable) smaller chip with weights for a specific use-case, or for fine-tuning with new knowledge/updates to the model.
The matrices can be multiplied LoRA style, applying the matrix in the secondary chip to the primary chip, resulting in up-to-date weights through which the prompt is pushed.
Yeah, they're clearly just starting out and just shipped their very first proof of concept. But to me, their plans seem generally reasonable https://taalas.com/the-path-to-ubiquitous-ai/, and like I wrote, if this kind of thing succeeds and could become some kind of cheaply producible commodity component, I think there's huge value in that. Alas, maybe not as a frontier model replacement, but say 10 years from now you can drop a cheap raspberry pi like device in your Lan and have a fast local engine for things like text sentiment analysis, text summarisation, voice recognition, basic vision and things like that, that would be pretty exciting to me (but maybe as you outlined, impossible in practice)
There is a reasonable kernel of an idea here, but only if you dial expectations WAY back. The 10 years speculation is just wrong though. Even in 10 years, their 8B param model isn't going to be in consumer devices.
6nm is just 7nm++ and the process will be a decade old in a few months. In the decade since, we've only had a slightly less than 3x increase in transistor density and that's including EUV, BSPD, and GAAFET (which means progress is likely going to slow down even more).
Even if we hit another 3x increase, their 815mm2 design will still be a bit over 90mm2. For comparison, the entire M5 Pro/Max CPU die is just 61.7nm.
If our current progress somehow holds (not likely), even 20 years from now the 8B model would be 30mm2. You need 30 years of dead consistent progress to get it down to an includable 10mm2.
As you can see, this doesn't make sense to invest in. As to the stuff like voice recognition or basic vision, these can often fit within 100m parameter models which would be around 10mm2 on their current 6nm design. That's doable today in custom edge computing devices.
The other possible use is cheap fallback models for AI companies. Moving to N2 and shrinking chips to 600mm2 to improve yields a bit would give about 50B parameters with 3 chips plus another FPGA-ish programmable chip for continuing training and interconnects for everything. You'd need hundreds of thousands of chips produced for that exact AI model just to get costs below $100,000 per board.
That seems like a lot of money for the AI model you are essentially giving away, but maybe it still beats the power and price of GPU server racks.
the flash models have fallen in size at least between deep seek models. Is there a limit to the shrinking capacity of the models?
That’s why this stuff should be a government mega project ultimately.
It is not market viable but it is sure as heck revolutionary. Like an atomic bomb but including more… peaceful uses.
That’s exactly where government should take rein like with ISS etc. However the models are too rapidly advancing for now for it to make sense
The government isn't going to be making chip fabs go any faster which is the biggest limitation here.
The second big issue is that it takes months to fab chips meaning your hardware AI is months to maybe a year or more behind the times when it lands.
I do think it makes sense for something like a medical scanner where the model simply doesn't need constant updates, but that doesn't need government involvement to ship.
https://taalas.com/
Taalas https://taalas.com/the-path-to-ubiquitous-ai/
Previous HN discussion: https://news.ycombinator.com/item?id=47103661
Damn that is crazy.
This is the reaction every time it's posted, and deservedly so.
Not opening here... HN killed?
What
How?
Which model is behind it?
It’s pure silicon. Llama3.
hugged to death?
[dead]