This is gold for Anthropic's profitability. The Claude Code addicts can double their spend to plow through tokens because they need to finish something by a deadline. OpenAI will have a similar product within a week but will only charge 3x the normal rate.
This angle might also be NVidias reason for buying Groq. People will pay a premium for faster tokens.
Note that you can't use this mode to get the most out of a subscription - they say it's always charged as extra usage:
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
Although if you visit the Usage screen right now, there's a deal you can claim for $50 free extra usage this month.
I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after reading the "Claude built a C compiler" article.
Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.
Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.
In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.
LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.
The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.
This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.
Where on earth are you getting these numbers?
Why would a SaaS company that is fighting for market dominance withhold 10x performance if they had it? Where are you getting 2.5x?
This is such bizarre magical thinking, borderline conspiratorial.
There is no reason to believe any of the big AI players are serving anything less than the best trade off of stability and speed that they can possibly muster, especially when their cost ratios are so bad.
This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.
Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.
That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.
They have contracts with companies, and those companies wont be able to change quickly. By the time those contracts will come back for renewals it will already be too late, their code becoming completely unreadable by humans. Individual devs can move quickly but companies don't.
Are you at all familiar with the architecture of systems like theirs?
The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.
Seriously, thinking at the price structure of this (6x the price for 2.5x the speed, if that's correct) it seems to target something like real time applications with very small context. Maybe vocal assistants? I guess that if you're doing development it makes more sense to parallelize over more agents rather than paying that much for a modest increase in speed.
I’m curious what’s behind the speed improvements. It seems unlikely it’s just prioritization, so what else is changing? Is it new hardware (à la Groq or Cerebras)? That seems plausible, especially since it isn’t available on some cloud providers.
Also wondering whether we’ll soon see separate “speed” vs “cleverness” pricing on other LLM providers too.
It comes from batching and multiple streams on a GPU. More people sharing 1 GPU makes everyone run slower but increases overall token throughput.
Mathematically it comes from the fact that this transformer block is this parallel algorithm. If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.
Its true for basically all hardware and most models. You can draw this Pareto curve of how much throughput per GPU vs how many tokens per second per stream. More tokens/s less total throughput.
See this graph for actual numbers:
Token Throughput per GPU vs. Interactivity
gpt-oss 120B • FP4 • 1K / 8K • Source: SemiAnalysis InferenceMAX™
There are a lot of knobs they could tweak. Newer hardware and traffic prioritisation would both make a lot of sense. But they could also lower batching windows to decrease queueing time at the cost of lower throughput, or keep the KV cache in GPU memory at the expense of reducing the number of users they can serve from each GPU node.
Why does this seem unlikely? I have no doubt they are optimizing all the time, including inference speed, but why could this particular lever not entirely be driven by skipping the queue? It's an easy way to generate more money.
Yes it's 100% prioritization. Through that it's also likely running on more GPUs at once but that's an artifact of prioritization at the datacenter level. Any task coming into an AI datacenter atm is split into fairly fined grained chunks of work and added to queues to be processed.
When you add a job with high priority all those chunks will be processed off the queue first by each and every GPU that frees up. It probably leads to more parallelism but... it's the prioritization that led to this happening. It's better to think of this as prioritization of your job leading to the perf improvement.
Here's a good blog for anyone interested which talks about prioritization and job scheduling. It's not quite at the datacenter level but the concepts are the same. Basically everything is thought of as a pipeline. All training jobs are low pri (they take months to complete in any case), customer requests are mid pri and then there's options for high pri. Everything in an AI datacenter is thought of in terms of 'flow'. Are there any bottlenecks? Are the pipelines always full and the expensive hardware always 100% utilized? Are the queues backlogs big enough to ensure full utilization at every stage?
I wonder if they might have mostly implemented this for themselves to use internally, and it is just prioritization but they don't expect too many others to pay the high cost.
I’d love to hear from engineers who find that faster speed is a big unlock for them.
The deadline piece is really interesting. I suppose there’s a lot of people now who are basically limited by how fast their agents can run and on very aggressive timelines with funders breathing down their necks?
> I’d love to hear from engineers who find that faster speed is a big unlock for them.
How would it not be a big unlock? If the answers were instant I could stay focused and iterate even faster instead of having a back-and-forth.
Right now even medium requests can take 1-2 minutes and significant work can take even longer. I can usually make some progress on a code review, read more docs, or do a tiny chunk of productive work but the constant context switching back and forth every 60s is draining.
You're comparing two different things. It's not useless knowledge, it's something you need to understand.
Opus fast mode is routed to different servers with different tuning that prioritizes individual response throughput. Same model served differently. Same response, just delivered faster.
The Gemini fast mode is a different model (most likely) with different levels of thinking applied. Very different response.
I don't think this is the case, according to the docs, right? The effort level will use fewer tokens, but the independent fast mode just somehow seems to use some higher priority infrastructure to serve your requests.
A developer can blast millions of tokens in minutes. When you have a context size of 250k that’s just 4 queries. But with tool usage and subsequent calls etc it can easily just do many millions in one request
But if you just ask a question or something it’ll take a while to spend a million tokens…
Yeah that’s what they try to do with the latest coding agents sub agents which only have the context they need etc. but atm it’s too much work to manage contexts at that level
While it's an excellent way to make more money in the moment, I think this might become a standard no-extra-cost feature in several months (see Opus becoming way cheaper and a default model within months). Mental load management while using agents will become even more important it seems.
Why would they cut a money making feature? In fact I am already imagining them asking for speed ransom every time you are in a pinch, some extra context space will also become buyable. Anthropic is in a penny pincher phase right now and they will try to milk everything. Watch them add micro transactions too.
Could be a use for the $50 extra usage credit. It requires extra usage to be enabled.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
After exceeding the increasingly shrinking session limit with Opus 4.6, I continued with the extra usage only for a few minutes and it consumed about $10 of the credit.
I can't imagine how quickly this Fast Mode goes through credit.
Inference is run on shared hardware already, so they're not giving you the full bandwidth of the system by default. This most likely just allocates more resources to your request.
AI data centers are a whole lot of pipelines pumping data around utilizing queues. They want those expensive power hungry cards near 100% utilized at all times. So they have a queue of jobs on each system ready to run, feeding into the GPU memory as fast as completed jobs are read out of memory (and passed into the next stage) and they aim to have enough backlog in these queues to keep the pipeline full. You see responses in seconds but at the data center you're request was broken into jobs, passed around into queues, processed in an orderly manner and pieced back together.
With fast mode you're literally skipping the queue. An outcome of all of this is that for the rest of us the responses will become slower the more people use this 'fast' option.
I do suspect they'll also soon have a slow option for those that have Claude doing things overnight with no real care for latency of the responses. The ultimate goal is pipelines of data hitting 100% hardware utilization at all times.
Will this mean that when cost is more important than latency that replies will now take longer?
I’m not in favor of the ad model chatgpt proposes. But business models like these suffer from similar traps.
If it works for them, then the logical next step is to convert more to use fast mode. Which naturally means to slow things down for those that didn’t pick/pay for fast mode.
We’ve seen it with iPhones being slowed down to make the newer model seem faster.
Not saying it’ll happen. I love Claude. But these business models almost always invite dark patterns in order to move the bottom line.
The API price is 6x that of normal Opus, so look forward to a new $1200/mo subscription that gives you the same amount of usage if you need the extra speed.
It's explicitly called out as excluded in the blue info bubble they have there.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
LLM programming is very easy. First you have to prompt it to not mistakes. Then you have to tell it to go fast. Software engineering is over bro, all humans will be replaced in 6 days bro
This is gold for Anthropic's profitability. The Claude Code addicts can double their spend to plow through tokens because they need to finish something by a deadline. OpenAI will have a similar product within a week but will only charge 3x the normal rate.
This angle might also be NVidias reason for buying Groq. People will pay a premium for faster tokens.
Note that you can't use this mode to get the most out of a subscription - they say it's always charged as extra usage:
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
Although if you visit the Usage screen right now, there's a deal you can claim for $50 free extra usage this month.
At this point why don't we just CNAME HN to the Claude marketing blog?
Looking at the "Decide when to use fast mode", it seems the future they want is:
- Long running autonomous agents and background tasks use regular processing.
- "Human in the loop" scenarios use fast mode.
Which makes perfect sense, but the question is - does the billing also make sense?
I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after reading the "Claude built a C compiler" article.
Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.
Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.
In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.
> Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.
They said the 2.5X offering is what they've been using internally. Now they're offering via the API: https://x.com/claudeai/status/2020207322124132504
LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.
The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.
This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.
Where on earth are you getting these numbers? Why would a SaaS company that is fighting for market dominance withhold 10x performance if they had it? Where are you getting 2.5x?
This is such bizarre magical thinking, borderline conspiratorial.
There is no reason to believe any of the big AI players are serving anything less than the best trade off of stability and speed that they can possibly muster, especially when their cost ratios are so bad.
Not magical thinking, not conspiratorial, just hypothetical.
Just because you can't afford to 10x all your customers' inference doesn't mean you can't afford to 10x your inhouse inference.
And 2.5x is from Anthropic's latest offering. But it costs you 6x normal API pricing.
Also, from a comment in another thread, from roon, who works at OpenAI:
> codex-5.2 is really amazing but using it from my personal and not work account over the weekend taught me some user empathy lol it’s a bit slow
[0] https://nitter.net/tszzl/status/2016338961040548123
This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.
Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.
That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.
They can now easily decrease the speed for the normal mode, and then users will have to pay more for fast mode.
Do you have any evidence that this is happening? Or is it just a hypothetical threat you're proposing?
These companies aren't operating in a vacuum. Most of their users could change providers quickly if they started degrading their service.
They have contracts with companies, and those companies wont be able to change quickly. By the time those contracts will come back for renewals it will already be too late, their code becoming completely unreadable by humans. Individual devs can move quickly but companies don't.
Are you at all familiar with the architecture of systems like theirs?
The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.
I am familiar with the business model. This is clear indication of what their future plan is.
Also, I just pointed out at the business issue, just raising a point which was not raised here. Just want people to be more cautious
Slowing down respect to what?
Slowing down with respect to original speed of response. Basically what we used to get few months back and what is the best possible experience.
There is no "original speed of response". The more resources you pour in, the faster it goes.
Watch them decrease resources for the normal mode so people are penny pinched into using fast mode.
Seriously, thinking at the price structure of this (6x the price for 2.5x the speed, if that's correct) it seems to target something like real time applications with very small context. Maybe vocal assistants? I guess that if you're doing development it makes more sense to parallelize over more agents rather than paying that much for a modest increase in speed.
I’m curious what’s behind the speed improvements. It seems unlikely it’s just prioritization, so what else is changing? Is it new hardware (à la Groq or Cerebras)? That seems plausible, especially since it isn’t available on some cloud providers.
Also wondering whether we’ll soon see separate “speed” vs “cleverness” pricing on other LLM providers too.
It comes from batching and multiple streams on a GPU. More people sharing 1 GPU makes everyone run slower but increases overall token throughput.
Mathematically it comes from the fact that this transformer block is this parallel algorithm. If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.
Its true for basically all hardware and most models. You can draw this Pareto curve of how much throughput per GPU vs how many tokens per second per stream. More tokens/s less total throughput.
See this graph for actual numbers:
Token Throughput per GPU vs. Interactivity gpt-oss 120B • FP4 • 1K / 8K • Source: SemiAnalysis InferenceMAX™
https://inferencemax.semianalysis.com/
There are a lot of knobs they could tweak. Newer hardware and traffic prioritisation would both make a lot of sense. But they could also lower batching windows to decrease queueing time at the cost of lower throughput, or keep the KV cache in GPU memory at the expense of reducing the number of users they can serve from each GPU node.
> It seems unlikely it’s just prioritization
Why does this seem unlikely? I have no doubt they are optimizing all the time, including inference speed, but why could this particular lever not entirely be driven by skipping the queue? It's an easy way to generate more money.
Yes it's 100% prioritization. Through that it's also likely running on more GPUs at once but that's an artifact of prioritization at the datacenter level. Any task coming into an AI datacenter atm is split into fairly fined grained chunks of work and added to queues to be processed.
When you add a job with high priority all those chunks will be processed off the queue first by each and every GPU that frees up. It probably leads to more parallelism but... it's the prioritization that led to this happening. It's better to think of this as prioritization of your job leading to the perf improvement.
Here's a good blog for anyone interested which talks about prioritization and job scheduling. It's not quite at the datacenter level but the concepts are the same. Basically everything is thought of as a pipeline. All training jobs are low pri (they take months to complete in any case), customer requests are mid pri and then there's options for high pri. Everything in an AI datacenter is thought of in terms of 'flow'. Are there any bottlenecks? Are the pipelines always full and the expensive hardware always 100% utilized? Are the queues backlogs big enough to ensure full utilization at every stage?
https://www.aleksagordic.com/blog/vllm
Until everyone buys it. Like fast pass at an amusement park where the fast line is still two hours long
At 6x the cost, and it requiring you to pay full API pricing, I don’t think this is going to be a concern.
It's a good way to squeeze extra out of a bunch of people without actually raising prices.
I wonder if they might have mostly implemented this for themselves to use internally, and it is just prioritization but they don't expect too many others to pay the high cost.
Roon said as much here [0]:
> codex-5.2 is really amazing but using it from my personal and not work account over the weekend taught me some user empathy lol it’s a bit slow
[0] https://nitter.net/tszzl/status/2016338961040548123
I see Anthropic says so here as well: https://x.com/claudeai/status/2020207322124132504
Nvidia GB300 i.e. Blackwell.
> so what else is changing?
Let me guess. Quantization?
I really like Anthropic's web design. This doc site looks like it's using gitbook (or a clone of gitbook) but they make it look so nice.
Its just https://www.mintlify.com/ with barely customized theme
Looks like mintlify to me. Especially the copy page button.
I’d love to hear from engineers who find that faster speed is a big unlock for them.
The deadline piece is really interesting. I suppose there’s a lot of people now who are basically limited by how fast their agents can run and on very aggressive timelines with funders breathing down their necks?
> I’d love to hear from engineers who find that faster speed is a big unlock for them.
How would it not be a big unlock? If the answers were instant I could stay focused and iterate even faster instead of having a back-and-forth.
Right now even medium requests can take 1-2 minutes and significant work can take even longer. I can usually make some progress on a code review, read more docs, or do a tiny chunk of productive work but the constant context switching back and forth every 60s is draining.
If it could help avoid you needing to context switch between multiple agents, that could be a big mental load win.
The idea of development teams bottlenecked by agent speed rather than people, ideas, strategy, etc. gives me some strange vibes.
The one question I have that isn't answered by the page is how much faster?
Obviously they can't make promises but I'd still like a rough indication of how much this might improve the speed of responses.
Yeah is this cerebras/groq speed, or I just skip the queue?
2.5x faster or so (https://x.com/claudeai/status/2020207322124132504).
6x more expensive
[dead]
It doesn’t say how much faster it is but from my experience with OpenAI’s “service_tier=priority” option on SQLAI.ai is that it’s twice as fast.
So fast mode uses more tokens, in direct opposition to Gemini where fast 'mode' means less. One more piece of useless knowledge to remember.
You're comparing two different things. It's not useless knowledge, it's something you need to understand.
Opus fast mode is routed to different servers with different tuning that prioritizes individual response throughput. Same model served differently. Same response, just delivered faster.
The Gemini fast mode is a different model (most likely) with different levels of thinking applied. Very different response.
I don't think this is the case, according to the docs, right? The effort level will use fewer tokens, but the independent fast mode just somehow seems to use some higher priority infrastructure to serve your requests.
The pricing on this is absolutely nuts.
For us mere mortals, how fast does a normal developer for through a MTok. How about a good power user?
A developer can blast millions of tokens in minutes. When you have a context size of 250k that’s just 4 queries. But with tool usage and subsequent calls etc it can easily just do many millions in one request
But if you just ask a question or something it’ll take a while to spend a million tokens…
Seems like an opportunity to condense the context into 'documentation' level and only load the full text/code for files that expect to be edited?
Yeah that’s what they try to do with the latest coding agents sub agents which only have the context they need etc. but atm it’s too much work to manage contexts at that level
While it's an excellent way to make more money in the moment, I think this might become a standard no-extra-cost feature in several months (see Opus becoming way cheaper and a default model within months). Mental load management while using agents will become even more important it seems.
Why would they cut a money making feature? In fact I am already imagining them asking for speed ransom every time you are in a pinch, some extra context space will also become buyable. Anthropic is in a penny pincher phase right now and they will try to milk everything. Watch them add micro transactions too.
Yeah especially once they make an even faster fast mode.
Could be a use for the $50 extra usage credit. It requires extra usage to be enabled.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
After exceeding the increasingly shrinking session limit with Opus 4.6, I continued with the extra usage only for a few minutes and it consumed about $10 of the credit.
I can't imagine how quickly this Fast Mode goes through credit.
It has to be. The timing is just too close.
AFAIK, they don't have any deals or partnerships with Groq or Cerebras or any of those kinds of companies.. so how did they do this?
Inference is run on shared hardware already, so they're not giving you the full bandwidth of the system by default. This most likely just allocates more resources to your request.
Could well be running on Google TPUs.
Whatever optimisation is going on is at the hardware level since the fast option persists in a session.
It's a good way to address the price insensitive segment. As long as they don't slow down the rest, good move.
Where is this perf gain coming from? Running on TPUs?
AI data centers are a whole lot of pipelines pumping data around utilizing queues. They want those expensive power hungry cards near 100% utilized at all times. So they have a queue of jobs on each system ready to run, feeding into the GPU memory as fast as completed jobs are read out of memory (and passed into the next stage) and they aim to have enough backlog in these queues to keep the pipeline full. You see responses in seconds but at the data center you're request was broken into jobs, passed around into queues, processed in an orderly manner and pieced back together.
With fast mode you're literally skipping the queue. An outcome of all of this is that for the rest of us the responses will become slower the more people use this 'fast' option.
I do suspect they'll also soon have a slow option for those that have Claude doing things overnight with no real care for latency of the responses. The ultimate goal is pipelines of data hitting 100% hardware utilization at all times.
Will this mean that when cost is more important than latency that replies will now take longer?
I’m not in favor of the ad model chatgpt proposes. But business models like these suffer from similar traps.
If it works for them, then the logical next step is to convert more to use fast mode. Which naturally means to slow things down for those that didn’t pick/pay for fast mode.
We’ve seen it with iPhones being slowed down to make the newer model seem faster.
Not saying it’ll happen. I love Claude. But these business models almost always invite dark patterns in order to move the bottom line.
Is this is the beginning of the ‘Speedy boarding’ / ‘Fastest delivery’ enshitification?
Where everyone is forced to pay for a speed up because the ‘normal’ service just gets slower and slower.
I hope not. But I fear.
This is to test the room before real enshitification happens. Companies who bought from Anthropic are really in for a ride.
I pay $200 a month and don't get any included access to this? Ridiculous
Well, you can burn your $50 bonus on it
The API price is 6x that of normal Opus, so look forward to a new $1200/mo subscription that gives you the same amount of usage if you need the extra speed.
I always wondered this, is this true/does the math come out to be really that bad? 6x?
Is the writing on the wall for $100-$200/mo users that, it's basically known-subsidized for now and $400/mo+ is coming sooner than we think?
Are they getting us all hooked and then going to raise it in the future, or will inference prices go down to offset?
..But it says "Available to all Claude Code users on subscription plans (Pro/Max/Team/Enterprise) and Claude Console."
Is this wrong?
It's explicitly called out as excluded in the blue info bubble they have there.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
https://code.claude.com/docs/en/fast-mode#requirements
I think this is just worded in a misleading way. It’s available to all users, but it’s not included as part of the plan.
Instead of better/cheaper/faster you just the the last one?
Back to Gemini.
But waiting for the agent to finish is my 2026 equivalent of "compiling!"
https://xkcd.com/303/
Give me a slow mode that’s cheaper instead lol
Interesting, output price is insane/Mtok
> $30/150 MTok Umm no thank you
LLM programming is very easy. First you have to prompt it to not mistakes. Then you have to tell it to go fast. Software engineering is over bro, all humans will be replaced in 6 days bro
What is “$30/150MTok”? Claude Opus 4.6 is normally priced at “$25/MTok”. Am I just reading it wrong or is this a typo?
EDIT: I understand now. $30 for input, $150 for output. Very confusing wording. That’s insanely expensive!
Yeah I don't understand. Is it actually saying that fast mode is ten times more expensive than normal mode? I cannot be reading this right.