> They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.
I’ll patiently wait to see this in reality. Their demonstration hardware is a 250W chip that is enormous in die area for the model size. They’re making a lot of claims, but until they can deliver then it’s nearly vaporware in my view.
I’d be happy to be proven wrong, but I think they’re going to quickly run into hardware realities quite soon if they think they can just chain a bunch of chips together to achieve the same performance on larger sizes.
Why can't they do it? Jim Keller's company is also taking a different approach [0].
The simple fact that we think what we have now is scalable is basically what you are saying can't be done: " just chain a bunch of chips together to achieve the same performance on larger sizes". How do you think current architectures work? And what is being used today is all proprietary to one company!
[0] https://tenstorrent.com/solutions/llm-inference