OpenAI made a huge mistake neglecting fast inferencing models. Their strategy was gpt 5 for everything, which hasn't worked out at all. I'm really not sure what model OpenAI wants me to use for my applications that require lower latency. If I follow their advice in their API docs about which models I should use for faster responses I get told either use GPT 5 low thinking, or replace gpt 5 with gpt 4.1, or switch to the mini model. Now as a developer I'm doing evals on all three of these combinations. I'm running my evals on gemini 3 flash right now, and it's outperforming gpt5 thinking without thinking. OpenAI should stop trying to come up with ads and make models that are useful.

Hardware is a factor here. GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data. There are lots of other factors here, but latency specifically favours TPUs.

The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.

> GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data.

Where are you getting that? All the citations I've seen say the opposite, eg:

> Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design.

https://massedcompute.com/faq-answers/

> The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.

Both Cerebras and Grok have custom AI-processing hardware (not CPUs).

The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing.

I thought it was generally accepted that inference was faster on TPUs. This was one of my takeaways from the LLM scaling book: https://jax-ml.github.io/scaling-book/ – TPUs just do less work, and data needs to move around less for the same amount of processing compared to GPUs. This would lead to lower latency as far as I understand it.

The citation link you provided takes me to a sales form, not an FAQ, so I can't see any further detail there.

> Both Cerebras and Grok have custom AI-processing hardware (not CPUs).

I'm aware of Cerebras' custom hardware. I agree with the other commenter here that I haven't heard of Grok having any. My point about knowledge grounding was simply that Grok may be achieving its latency with guardrail/knowledge/safety trade-offs instead of custom hardware.

Sorry I meant Groq custom hardware, not Grok!

I don't see any latency comparisons in the link

The link is just to the book, the details are scattered throughout. That said the page on GPUs specifically speaks to some of the hardware differences and how TPUs are more efficient for inference, and some of the differences that would lead to lower latency.

https://jax-ml.github.io/scaling-book/gpus/#gpus-vs-tpus-at-...

Re: Groq, that's a good point, I had forgotten about them. You're right they too are doing a TPU-style systolic array processor for lower latency.

I'm pretty sure xAI exclusively uses Nvidia H100s for Grok inference but I could be wrong. I agree that I don't see why TPUs would necessarily explain latency.

To be clear I'm only suggesting that hardware is a factor here, it's far from the only reason. The parent commenter corrected their comment that it was actually Groq not Grok that they were thinking of, and I believe they are correct about that as Groq is doing something similar to TPUs to accelerate inference.

Why are GPUs necessarily higher latency than TPUs? Both require roughly the same arithmetic intensity and use the same memory technology at roughly the same bandwidth.

And our LLMs still have latencies well into the human perceptible range. If there's any necessary, architectural difference in latency between TPU and GPU, I'm fairly sure it would be far below that.

My understanding is that TPUs do not use memory in the same way. GPUs need to do significantly more store/fetch operations from HBM, where TPUs pipeline data through systolic arrays far more. From what I've heard this generally improves latency and also reduces the overhead of supporting large context windows.

Hard to find info but I think the -chat versions of 5.1 and 5.2 (gpt-5.2-chat) are what you're looking for. They might just be an alias for the same model with very low reasoning though. I've seen other providers do the same thing, where they offer a reasoning and non reasoning endpoint. Seems to work well enough.

They’re not the same, there are (at least) two different tunes per 5.x

For each you can use it as “instant” supposedly without thinking (though these are all exclusively reasoning models) or specify a reasoning amount (low, medium, high, and now xhigh - though if you do g specify it defaults to none) OR you can use the -chat version which is also “no thinking” but in practice performs markedly differently from the regular version with thinking off (not more or less intelligent but has a different style and answering method).

It's weird they don't document this stuff. Like understanding things like tool call latency and time to first token is extremely important in application development.

Humans often answer with fluff like "That's a good question, thanks for asking that, [fluff, fluff, fluff]" to give themselves more breathing room until the first 'token' of their real answer. I wonder if any LLM are doing stuff like that for latency hiding?

I don't think the models are doing this, time to first token is more of a hardware thing. But people writing agents are definitely doing this, particularly in voice it's worth it to use a smaller local llm to handle the acknowledgment before handing it off.

Do humans really do that often?

Coming up with all that fluff would keep my brain busy, meaning there's actually no additional breathing room for thinking about an answer.

People who professionally answer questions do that, yes. Eg politicians or press secretaries for companies, or even just your professor taking questions after a talk.

> Coming up with all that fluff would keep my brain busy, meaning there's actually no additional breathing room for thinking about an answer.

It gets a lot easier with practice: your brain caches a few of the typical fluff routines.

Yeah, I'm surprised that they've been through GT-5.1 and GPT-5.1-Codex and GPT-5.1-Codex-Max and now GPT-5.2 but their most recent mini model is still GPT-5-mini.

I cannot comprehend how they do not care about this segment of the market.

it's easy to comprehend actually. they're putting everything on "having the best model". It doesn't look like they're going to win, but that's still their bet/

I mean they’re trying to outdo google. So they need to do that.

Until recently, Google was the underdog in the LLM race and OpenAI was the reigning champion. How quickly perceptions shift!

I just want a deepseek moment for an open weights model fast enough to use in my app, I hate paying the big guys.

Isn't deepseek an open weights model?

yeah but not super fast like flash or grok fast

One can only hope OpenAI continues down the path they're on. Let them chase ads. Let them shoot themselves in the foot now. If they fail early maybe we can move beyond this ridiculous charade of generally useless models. I get it, applied in specific scenarios they have tangible use cases. But ask your non-tech caring friend or family member what frontier model was released this week and they'll not only be confused by what "frontier" means, but it's very likely they won't have any clue. Also ask them how AI is improving their lives on the daily. I'm not sure if we're at the 80% of model improvement as of yet, but given OpenAIs progress this year it seems they're at a very weak inflection point. Start serving ads so the house of cards can get a nudge.

And now with RAM, GPU and boards being a PitA to get based on supply and pricing - double middle finger to all the big tech this holiday season!

> OpenAI made a huge mistake neglecting fast inferencing models.

It's a lost battle. It'll always be cheaper to use an open source model hosted by others like together/fireworks/deepinfra/etc.

I've been maining Mistral lately for low latency stuff and the price-quality is hard to beat.

I'll try benchmarking mistral against my eval, I've been impressed by kimi's importance but it's too slow to do anything useful realtime.

I had wondered if they run their inference at high batch sizes to get better throughput to keep their inference costs lower.

They do have a priority tier at double the cost, but haven't seen any benchmarks on how much faster that actually is.

The flex tier was an underrated feature in GPT5, batch pricing with a regular API call. GPT5.1 using flex priority is an amazing price/intelligence tradeoff for non-latency sensitive applications, without needing to extra plumbing of most batch APIs

I’m sure they do something like that. I’ve noticed azure has way faster gpt 4.1 than OpenAI

> OpenAI should stop trying to come up with ads and make models that are useful.

Turns out becoming a $4 trillion company first with ads (Google), then owning everybody on the AI-front could be the winning strategy.

[deleted]

GPT 5 Mini is supposed to be equivalent to Gemini Flash.