But it’s irrelevant. 750 tokens/s on a full frontier model is useful. 15000 poor quality tokens is much less useful no matter how much scaffolding you put around it.
You are missing the point. This is a technology demonstration on prototype hardware, and no one intends it to be seriously useful.
Their architecture has fundamental speed and efficiency advantages over GPUs or Cerebras. They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.
> They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.
I’ll patiently wait to see this in reality. Their demonstration hardware is a 250W chip that is enormous in die area for the model size. They’re making a lot of claims, but until they can deliver then it’s nearly vaporware in my view.
I’d be happy to be proven wrong, but I think they’re going to quickly run into hardware realities quite soon if they think they can just chain a bunch of chips together to achieve the same performance on larger sizes.
Why can't they do it? Jim Keller's company is also taking a different approach [0].
The simple fact that we think what we have now is scalable is basically what you are saying can't be done: " just chain a bunch of chips together to achieve the same performance on larger sizes". How do you think current architectures work? And what is being used today is all proprietary to one company!
Actually it's the opposite. Per mm of silicon it's massively less efficient and making enough chips and powering them is a major bottleneck right now. Worse, scaling to larger models requires more of our absolute best quality silicon manufacturing, where e.g. an H200 mostly just needs more memory.
But I’m not missing the point. If you can run one frontier model at 750t/s, then you can probably run many many instances of an SLM in parallel at a rate that exceeds 15k/s. That’s kinda the point of the flash or ultrafast variants. And they’re on something much more modern than llama3.1.
Yes, you are missing the point. 1) It's a demo. [0] 2) It hasn't been updated for 4+ months.
You don't need LLMs for everything. That is 100% the point. You can burn down the world with all of your frontier LLMs that are being used for simple queries OR we can do something faster and more efficient like this. Just because you can run a SotA model at "fast" speeds, again, severely misses the point.
And no, you can't run anything from Anthropic or OAI on-prem, so until you can there's really no comparison. If people want to continue down the path of gate-kept models with no other options then we'll all follow you off the cliff.
Why are you representing this as such a binary here? For SLM we don’t need the Taalas stuff at all. Just run it locally on your own device if it’s truly a small model. And there’s plenty of larger models that can be run on-premise just fine.
I think it’s impressive that a frontier model can achieve 750t/s. That’s all. You can get similar insane token speeds from other open weight models too.
The irony here is, according to you, my take is the binary one. When your response is: well, we can all just run it on our devices - we don't need any other options!
You seem to be cool with a very small and gated ecosystem with whatever tech billionaires want you to have access to.
I grew up in the era where compute was diverse and open. You may think this is OK, but it's not. The more options we have and the more diversified they are the better tech will move back towards.
I'm not the one with the myopic view here. Enjoy your "on-device" models over in your utopia of a walled garden.
I think you’ve got things quite backwards if you think that the desire to run models on device or use any of the variety of open weight models (big or small) on premise is somehow bowing down to tech billionaires. Quite the opposite really.
Once again, my statement is that the Taalas product is not a fair comparison because it runs an old outdated model. If you want to run a similar model at similar speeds (albeit not serially, but in parallel) you don’t need their product.
> Once again, my statement is that the Taalas product is not a fair comparison because it runs an old outdated model.
Either you didn't look at the page I linked or you're having comprehension problems.
> If you want to run a similar model at similar speeds (albeit not serially, but in parallel) you don’t need their product.
Except, you can't. There's no commodity hardware out there today that can run even an "old outdated model" at this speed and power utilization. Again, maybe read first and try to understand my original point?
> "...my statement is that the Taalas product is not a fair comparison..."
You actually hadn't stated this. You said it wasn't needed. Which is it?
> If you want to run a similar model at similar speeds...
You can't. Find me a single system that can run this, again, "old outdated model" at even similar speed. You're hung up on the model. The point is that if we all just stay in this wonderful world of inefficient large models we will all end up at the mercy of OAI, Anthropic, Google, etc. When other companies, like Taalas are putting research dollars in to making AI scalable, affordable and efficient. Do you really think commodity hardware is going to be attainable anytime in the near future on this trajectory? Do you need a laptop to cost $10k USD before it clicks? That is exactly how you end up kissing Altman's ass in this situation.
But it’s irrelevant. 750 tokens/s on a full frontier model is useful. 15000 poor quality tokens is much less useful no matter how much scaffolding you put around it.
You are missing the point. This is a technology demonstration on prototype hardware, and no one intends it to be seriously useful.
Their architecture has fundamental speed and efficiency advantages over GPUs or Cerebras. They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.
> They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.
I’ll patiently wait to see this in reality. Their demonstration hardware is a 250W chip that is enormous in die area for the model size. They’re making a lot of claims, but until they can deliver then it’s nearly vaporware in my view.
I’d be happy to be proven wrong, but I think they’re going to quickly run into hardware realities quite soon if they think they can just chain a bunch of chips together to achieve the same performance on larger sizes.
Why can't they do it? Jim Keller's company is also taking a different approach [0].
The simple fact that we think what we have now is scalable is basically what you are saying can't be done: " just chain a bunch of chips together to achieve the same performance on larger sizes". How do you think current architectures work? And what is being used today is all proprietary to one company!
[0] https://tenstorrent.com/solutions/llm-inference
Actually it's the opposite. Per mm of silicon it's massively less efficient and making enough chips and powering them is a major bottleneck right now. Worse, scaling to larger models requires more of our absolute best quality silicon manufacturing, where e.g. an H200 mostly just needs more memory.
I’ve been using 1,000 t/s on a near frontier model for a month now. It’s very useful for agentic coding.
It does require new approaches for me personally since I get a lot less time to think or read its output.
I think you missed the point and don't understand / aren't considerate of SLM utility.
But I’m not missing the point. If you can run one frontier model at 750t/s, then you can probably run many many instances of an SLM in parallel at a rate that exceeds 15k/s. That’s kinda the point of the flash or ultrafast variants. And they’re on something much more modern than llama3.1.
Yes, you are missing the point. 1) It's a demo. [0] 2) It hasn't been updated for 4+ months.
You don't need LLMs for everything. That is 100% the point. You can burn down the world with all of your frontier LLMs that are being used for simple queries OR we can do something faster and more efficient like this. Just because you can run a SotA model at "fast" speeds, again, severely misses the point.
And no, you can't run anything from Anthropic or OAI on-prem, so until you can there's really no comparison. If people want to continue down the path of gate-kept models with no other options then we'll all follow you off the cliff.
[0] https://taalas.com/products/
Why are you representing this as such a binary here? For SLM we don’t need the Taalas stuff at all. Just run it locally on your own device if it’s truly a small model. And there’s plenty of larger models that can be run on-premise just fine.
I think it’s impressive that a frontier model can achieve 750t/s. That’s all. You can get similar insane token speeds from other open weight models too.
The irony here is, according to you, my take is the binary one. When your response is: well, we can all just run it on our devices - we don't need any other options!
You seem to be cool with a very small and gated ecosystem with whatever tech billionaires want you to have access to.
I grew up in the era where compute was diverse and open. You may think this is OK, but it's not. The more options we have and the more diversified they are the better tech will move back towards.
I'm not the one with the myopic view here. Enjoy your "on-device" models over in your utopia of a walled garden.
I think you’ve got things quite backwards if you think that the desire to run models on device or use any of the variety of open weight models (big or small) on premise is somehow bowing down to tech billionaires. Quite the opposite really.
Once again, my statement is that the Taalas product is not a fair comparison because it runs an old outdated model. If you want to run a similar model at similar speeds (albeit not serially, but in parallel) you don’t need their product.
> Once again, my statement is that the Taalas product is not a fair comparison because it runs an old outdated model.
Either you didn't look at the page I linked or you're having comprehension problems.
> If you want to run a similar model at similar speeds (albeit not serially, but in parallel) you don’t need their product.
Except, you can't. There's no commodity hardware out there today that can run even an "old outdated model" at this speed and power utilization. Again, maybe read first and try to understand my original point?
> "...my statement is that the Taalas product is not a fair comparison..."
You actually hadn't stated this. You said it wasn't needed. Which is it?
> If you want to run a similar model at similar speeds...
You can't. Find me a single system that can run this, again, "old outdated model" at even similar speed. You're hung up on the model. The point is that if we all just stay in this wonderful world of inefficient large models we will all end up at the mercy of OAI, Anthropic, Google, etc. When other companies, like Taalas are putting research dollars in to making AI scalable, affordable and efficient. Do you really think commodity hardware is going to be attainable anytime in the near future on this trajectory? Do you need a laptop to cost $10k USD before it clicks? That is exactly how you end up kissing Altman's ass in this situation.
[dead]