Hacker News

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

macNchz 21 hours ago [ - ]

This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.

Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.

gavmor 18 hours ago [ - ]

Take a look at https://chatjimmy.ai/ -- it's running against Taalas' "hardcore" silicon model, ie a dedicated, ASIC-like chip.

bikelang 13 hours ago [ - ]

Wow - actually pretty astonishing how fast their inference is. So fast it feels fake?

qingcharles 12 hours ago [ - ]

Yeah, when you find fast inference like that it almost feels like the answer arrives before you hit return. Now imagine it running locally with no server round-trip.

adamsmark 11 hours ago [ - ]

Groq was the preview of the broadband era of LLMs for me. I remember asking a question on the demo site and the answer text showed up near instantly. Far faster than I could read. This was ~1 year ago and pre-acquisition.

garciasn 21 hours ago [ - ]

You're right about it being reminiscent of the dial-up area, but I don't believe it's 300 to 1200; it's more like 4800:

Modem vs Claude according to Claude:

300 @ 2368 characters - 1m 19s

1200 @ 2368 characters - 19.7s

2400 @ 2368 characters - 9.9s

14.4K @ 2368 characters - 1.6s

33.6K @ 2368 characters - 705 ms

56K @ 2368 characters - 447 ms

Claude @ 2368 characters - 7.9s

jeffhuys 21 hours ago [ - ]

Check chatjimmy.ai

lelandbatey 19 hours ago [ - ]

https://chatjimmy.ai being a demo of the "burn the model to an ASIC" approach being sold by Taalas[0], an approach which they use to run Llama 3.1 8B at ~17000 tokens per second.

[0] - https://taalas.com/products/

snek_case 11 hours ago [ - ]

Not to downplay their accomplishment but Llama 3.1 8B is a terrible model. It's really outdated at this point. It's cool that they were able to accelerate a model with silicon, but it also feels wasteful since llama 8B is such a useless model?

puilp0502 8 hours ago [ - ]

I guess their point was to demonstrate that it's possible to bake a decently-sized model to a silicon? As with anything related to HW, I guess the lead time will be considerably larger than the software counterparts, so I guess in 1-2 years timeframe we might see something like Gemma 4 baked onto a silicon.

leoedin 6 hours ago [ - ]

Yeah, I think the important part is the process to convert the model to silicon, not the actual implementation itself.

Whether it succeeds now depends a lot on the rate of improvement of model architecture. They're betting on model design and capability improvements slowing down - and then wiping the floor with everyone else with their inference economics.

imtringued 6 hours ago [ - ]

I agree, Gemma 3 12B is a very good model for its size and it was only obsoleted by Gemma 4.

Heck, I'm still a fan of Gemma 2 9B.

MagicMoonlight 20 hours ago [ - ]

There was a startup posted here which built custom hardware that let the AI respond instantly. Thousands of tokens per second.

tln 19 hours ago [ - ]

Taalas. A sibling comment of yours posted the chat demo URL -

https://chatjimmy.ai/

2ndorderthought 19 hours ago [ - ]

Woah. How is this working? It's stupid fast.

mike_hearn 5 hours ago [ - ]

The weights are mapped directly to transistors. It's not a generic processor, it's literally a dedicated Llama 8B chip that can't be used for anything else. When you specialize in hardware you get faster - Taalas is pushing that to the limit.

They seem to be doing well. I checked recently and their API is closed to signups due to overwhelming demand.

Grosvenor 19 hours ago [ - ]

cerebras

They built an entire wafer ASIC. The entire thing is one huge active ASIC. it takes a lot of cool engineering and cooling to make it work, and is very cool.

zargon 20 hours ago [ - ]

Groq.

beavisringdin 20 hours ago [ - ]

No, it was a custom ASIC chip with weights baked in for a singular model. I do envision a future where we return to cartridges. Local AI is de facto and massively optimised chips are built to be plug and play running a single SoTA model.

SJMG 19 hours ago [ - ]

Likely https://taalas.com

17 hours ago [ - ]

[deleted]