Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.
Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.
This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.
Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.
Take a look at https://chatjimmy.ai/ -- it's running against Taalas' "hardcore" silicon model, ie a dedicated, ASIC-like chip.
Wow - actually pretty astonishing how fast their inference is. So fast it feels fake?
Yeah, when you find fast inference like that it almost feels like the answer arrives before you hit return. Now imagine it running locally with no server round-trip.
Groq was the preview of the broadband era of LLMs for me. I remember asking a question on the demo site and the answer text showed up near instantly. Far faster than I could read. This was ~1 year ago and pre-acquisition.
You're right about it being reminiscent of the dial-up area, but I don't believe it's 300 to 1200; it's more like 4800:
Modem vs Claude according to Claude:
300 @ 2368 characters - 1m 19s
1200 @ 2368 characters - 19.7s
2400 @ 2368 characters - 9.9s
14.4K @ 2368 characters - 1.6s
33.6K @ 2368 characters - 705 ms
56K @ 2368 characters - 447 ms
Claude @ 2368 characters - 7.9s
Check chatjimmy.ai
https://chatjimmy.ai being a demo of the "burn the model to an ASIC" approach being sold by Taalas[0], an approach which they use to run Llama 3.1 8B at ~17000 tokens per second.
[0] - https://taalas.com/products/
Not to downplay their accomplishment but Llama 3.1 8B is a terrible model. It's really outdated at this point. It's cool that they were able to accelerate a model with silicon, but it also feels wasteful since llama 8B is such a useless model?
I guess their point was to demonstrate that it's possible to bake a decently-sized model to a silicon? As with anything related to HW, I guess the lead time will be considerably larger than the software counterparts, so I guess in 1-2 years timeframe we might see something like Gemma 4 baked onto a silicon.
Yeah, I think the important part is the process to convert the model to silicon, not the actual implementation itself.
Whether it succeeds now depends a lot on the rate of improvement of model architecture. They're betting on model design and capability improvements slowing down - and then wiping the floor with everyone else with their inference economics.
I agree, Gemma 3 12B is a very good model for its size and it was only obsoleted by Gemma 4.
Heck, I'm still a fan of Gemma 2 9B.
There was a startup posted here which built custom hardware that let the AI respond instantly. Thousands of tokens per second.
Taalas. A sibling comment of yours posted the chat demo URL -
https://chatjimmy.ai/
Woah. How is this working? It's stupid fast.
The weights are mapped directly to transistors. It's not a generic processor, it's literally a dedicated Llama 8B chip that can't be used for anything else. When you specialize in hardware you get faster - Taalas is pushing that to the limit.
They seem to be doing well. I checked recently and their API is closed to signups due to overwhelming demand.
cerebras
They built an entire wafer ASIC. The entire thing is one huge active ASIC. it takes a lot of cool engineering and cooling to make it work, and is very cool.
Groq.
No, it was a custom ASIC chip with weights baked in for a singular model. I do envision a future where we return to cartridges. Local AI is de facto and massively optimised chips are built to be plug and play running a single SoTA model.
Likely https://taalas.com