Hacker News

For comparison, openrouter says opus 4.8 is ~55 tokens/s and fast mode is ~102.

750 tokens/s for their largest model is going to be nuts

What about 15k tokens per second? [0] I remember looking at this earlier in the year and it being so fast that it feels fake. And, yes, this model is old - but still awesome for what it is.

[0] https://chatjimmy.ai/

Kirby64 a day ago [ - ]

It’s not just old, it’s also tiny and quantized. It’s llama 3.1 8b at 3/6-bit quant. This is the type of thing you can run on almost any device…

windexh8er a day ago [ - ]

I get that, but not at 15k tokens/s.

Kirby64 a day ago [ - ]

But it’s irrelevant. 750 tokens/s on a full frontier model is useful. 15000 poor quality tokens is much less useful no matter how much scaffolding you put around it.

Legend2440 a day ago [ - ]

You are missing the point. This is a technology demonstration on prototype hardware, and no one intends it to be seriously useful.

Their architecture has fundamental speed and efficiency advantages over GPUs or Cerebras. They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.

Kirby64 a day ago [ - ]

> They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.

I’ll patiently wait to see this in reality. Their demonstration hardware is a 250W chip that is enormous in die area for the model size. They’re making a lot of claims, but until they can deliver then it’s nearly vaporware in my view.

I’d be happy to be proven wrong, but I think they’re going to quickly run into hardware realities quite soon if they think they can just chain a bunch of chips together to achieve the same performance on larger sizes.

windexh8er 20 hours ago [ - ]

Why can't they do it? Jim Keller's company is also taking a different approach [0].

The simple fact that we think what we have now is scalable is basically what you are saying can't be done: " just chain a bunch of chips together to achieve the same performance on larger sizes". How do you think current architectures work? And what is being used today is all proprietary to one company!

[0] https://tenstorrent.com/solutions/llm-inference

__alexs 12 hours ago [ - ]

Actually it's the opposite. Per mm of silicon it's massively less efficient and making enough chips and powering them is a major bottleneck right now. Worse, scaling to larger models requires more of our absolute best quality silicon manufacturing, where e.g. an H200 mostly just needs more memory.

trollbridge 7 hours ago [ - ]

I’ve been using 1,000 t/s on a near frontier model for a month now. It’s very useful for agentic coding.

It does require new approaches for me personally since I get a lot less time to think or read its output.

windexh8er a day ago [ - ]

I think you missed the point and don't understand / aren't considerate of SLM utility.

Kirby64 a day ago [ - ]

But I’m not missing the point. If you can run one frontier model at 750t/s, then you can probably run many many instances of an SLM in parallel at a rate that exceeds 15k/s. That’s kinda the point of the flash or ultrafast variants. And they’re on something much more modern than llama3.1.

windexh8er a day ago [ - ]

Yes, you are missing the point. 1) It's a demo. [0] 2) It hasn't been updated for 4+ months.

You don't need LLMs for everything. That is 100% the point. You can burn down the world with all of your frontier LLMs that are being used for simple queries OR we can do something faster and more efficient like this. Just because you can run a SotA model at "fast" speeds, again, severely misses the point.

And no, you can't run anything from Anthropic or OAI on-prem, so until you can there's really no comparison. If people want to continue down the path of gate-kept models with no other options then we'll all follow you off the cliff.

[0] https://taalas.com/products/

Kirby64 21 hours ago [ - ]

Why are you representing this as such a binary here? For SLM we don’t need the Taalas stuff at all. Just run it locally on your own device if it’s truly a small model. And there’s plenty of larger models that can be run on-premise just fine.

I think it’s impressive that a frontier model can achieve 750t/s. That’s all. You can get similar insane token speeds from other open weight models too.

windexh8er 21 hours ago [ - ]

The irony here is, according to you, my take is the binary one. When your response is: well, we can all just run it on our devices - we don't need any other options!

You seem to be cool with a very small and gated ecosystem with whatever tech billionaires want you to have access to.

I grew up in the era where compute was diverse and open. You may think this is OK, but it's not. The more options we have and the more diversified they are the better tech will move back towards.

I'm not the one with the myopic view here. Enjoy your "on-device" models over in your utopia of a walled garden.

Kirby64 21 hours ago [ - ]

I think you’ve got things quite backwards if you think that the desire to run models on device or use any of the variety of open weight models (big or small) on premise is somehow bowing down to tech billionaires. Quite the opposite really.

Once again, my statement is that the Taalas product is not a fair comparison because it runs an old outdated model. If you want to run a similar model at similar speeds (albeit not serially, but in parallel) you don’t need their product.

windexh8er 20 hours ago [ - ]

> Once again, my statement is that the Taalas product is not a fair comparison because it runs an old outdated model.

Either you didn't look at the page I linked or you're having comprehension problems.

> If you want to run a similar model at similar speeds (albeit not serially, but in parallel) you don’t need their product.

Except, you can't. There's no commodity hardware out there today that can run even an "old outdated model" at this speed and power utilization. Again, maybe read first and try to understand my original point?

> "...my statement is that the Taalas product is not a fair comparison..."

You actually hadn't stated this. You said it wasn't needed. Which is it?

> If you want to run a similar model at similar speeds...

You can't. Find me a single system that can run this, again, "old outdated model" at even similar speed. You're hung up on the model. The point is that if we all just stay in this wonderful world of inefficient large models we will all end up at the mercy of OAI, Anthropic, Google, etc. When other companies, like Taalas are putting research dollars in to making AI scalable, affordable and efficient. Do you really think commodity hardware is going to be attainable anytime in the near future on this trajectory? Do you need a laptop to cost $10k USD before it clicks? That is exactly how you end up kissing Altman's ass in this situation.

huflungdung 21 hours ago [ - ]

[dead]

ehsankia 14 hours ago [ - ]

I just tried it, and the answer is non-sense.

I asked it something simple, list some good indie puzzle games, and half the answers are games that don't exist. Imo quality > speed.

partsch a day ago [ - ]

They baked the LLM into a CPU

calvinmorrison 19 hours ago [ - ]

at 15K tokens/s... do you need code anymore

selcuka 18 hours ago [ - ]

Yeah, that's the point, right? With tool calling the LLM becomes code. So instead of asking it to write an accounting software, you can hire the LLM to be your accountant.

recursive 18 hours ago [ - ]

But you'd still need code if you need something done in a consistent way.

block_dagger 16 hours ago [ - ]

Not necessarily. Consider a human assistant who performs repetitive tasks at an acceptable cost and accuracy while dealing with edge cases often autonomously.

rafaelmn 8 hours ago [ - ]

If we want reliability - we come up with processes to make it reliable and not rely on individuals getting it right. Code is a way to create a reliable process in the digital world.

recursive 7 hours ago [ - ]

For some things that's acceptable or even good. If I want to add up a list of a million numbers human assistants aren't bringing any advantages though.

saxenaabhi 11 hours ago [ - ]

Maybe acceptable in some cases but the original example in this thread was about accounting and they use software to do the counting not humans.

And even id humans/llms do it there would still be a need for systems of record with things like audit log etc.

gandreani a day ago [ - ]

Using gpt-5.4-mini in off-peak hours already feels like super-speed to me. That's probably no more than 100-150 tk/s. I can't imagine 750!

I've always eyed Cerebras but never had a use for it that would justify paying for the API directly. Although now that I think about it, trying out the API would probably cost less than a subscription for a month...

jasonjmcghee a day ago [ - ]

Try gpt-5.3-codex-spark - it's 1000 TPS and from my experience more capable than 5.4 mini.

If you have a subscription it's a different pool of usage.

small_model a day ago [ - ]

Used it, very fast but tiny context window and doesn't have good reasoning. (good for quick simple code changes)

trollbridge 7 hours ago [ - ]

MIMO 2.5 Pro ultraspeed has a 1M window. 1,000 tok/sec is great for planning since you can have a rapid conversation with a lot of turns.

beering a day ago [ - ]

Agreed, 1000tok/s just fills up the context window (which is big by 2004 standards) super fast. But seems like 5.3-spark was just a taste of what’s to come.

taneq a day ago [ - ]

2004 standards? O.o

mlinsey 20 hours ago [ - ]

In 2004, I took a class where we trained "language models" that were bigram word models, on an archive of a couple years of the Wall Street Journal.

I remember someone who literally announced they were dropping the class to the whole room at the end of a lecture, saying "This isn't AI!!!"

partsch a day ago [ - ]

1904

bogeholm 12 hours ago [ - ]

Back when we were kids, we would get 0 tokens/sec _if we were lucky_

embedding-shape a day ago [ - ]

The ChatGPT subscription gives you access to the -spark model(s) in Codex which are blazing fast (but pretty dumb) which I think runs on Cerebras hardware too.

rrvsh 16 hours ago [ - ]

is this specifically in codex? have been trying to use the models for months on opencode then pi but it says chatgpt subscriptions don't have access to it - i was under the assumption that OpenAI doesn't lock down their models based on harness a la Claude Code

cactusplant7374 5 hours ago [ - ]

What plan are you on? It is only available to Pro users.

kegs_ a day ago [ - ]

I have a pretty good use case for gpt-oss. The amount of time savings has actually been wild. Definitely worth a try. Just to be clear, it gets like 2000tok/s

comboy a day ago [ - ]

But it seems that there is some queuing/load balancing on their side, I mean when opus is actually outputting this 55t/s it feles fast, but apart from it's internal reasoning I think there's sometimes just waiting.

fragmede a day ago [ - ]

Oh wait yeah good point. At 750 tokens a second and the same amount of human patients they can set it to think for the same amount of time but four or five times the amount of thinking tokens, which may improve the quality of the eventual output.

order-matters a day ago [ - ]

the more advanced models also utilize a lot more tokens, and a lot of these extra tokens may go towards safeguards at a higher rate than prior models as well.

not to say a speed boost isnt there but if they didnt increase tokens / s at all youd likely see things slow down a lot with the new model compared to current

beering a day ago [ - ]

I think regular users will still have the old speed, so should be easy to tell whether it is more thinkier than 5.5.