Per million input/output tokens:

Gemini 2.5 flash: $0.30/$2.50

Gemini 3.0 flash preview: $0.50/$3.00

Gemini 3.5 flash: $1.50/$9.00

Interesting pricing direction. I don't think we have ever seen a 3x price increase for in the immediate next same-sized model (and lol @ 3 only ever getting a preview).

3.5 flash costs similar to Gemini 2.5 pro which was $1.25/$10

This understates the cost increase. 3.5 Flash also uses more tokens. artificialanalysis.ai shows these difference to run the whole eval, which I think is more realistic pricing:

Gemini 2.5 flash (27 score): $172 (1.0x)

Gemini 2.5 pro (35 score): $649 (3.8x)

Gemini 3.0 Flash (46 score): $278 (1.6x)

Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)

This is a massive price increase... 5.6x compared to Gemini 3.0 Flash

They probably never intended to keep serving cheap models. This is a natural way to introduce the squeeze, now that they have people who built services on their API. It makes a lot of sense to have an abstraction layer where the provider doesn't matter. If you are working in Kotlin, Koog is excellent.

I think the big 3 are cartelizing and starting to ratchet up costs. GPT5.5 is not easily distinguishable from 5.1. I would it be shocked if we hit the ceiling and everyone is quietly positioning for the exit.

switching models is insanely cheap compared to token cost on anything signficant, this is a take so cynical it misses the reality

in any corporate or half compliance-relevant setting switching isn't trivial. new DPA, subprocessor notifications, TIA, procurement review, security questionnaires, plus re-running your evals because prompts don't transfer 1:1. token cost is just one of the line items.

[deleted]

no it really not, even the soggiest bank has multiple api vendors atm.

I agree with parent. I'm not sure where your stance is coming from.

From what I hear, most enterprise AI deployments are seat-based subscriptions with annual commitments.

50K FTE global firm. We’re still piloting ChatGPT. AI is a four-letter word and there are ridiculous ceremonies and hundreds of hours of overhead for every trivial use case.

Amusingly, Enterprise credits are more expensive than just paying a zero-commitment on-demand API fee. Personal accounts are still the best value.

Yes, I work at a 50 person startup and even here switching from CC to codex or cursor would be non-trivial for multiple reasons - not just the annual commitment.

> now that they have people who built services on their API

People really can’t wait to be the next Zynga

I use Gemini for heavy web scraping-adjacent API work. Web grounding has been super useful for the project.

I will definitely not be updating to this new model, and I think once 2.5 Flash is deprecated I'll have to re-architect so Gemini is only used for web grounding requests. This is an insane price increase.

If Google is actually getting cheaper inference than everyone else with their TPUs, this smells like trouble to me. Maybe serving LLMs at a profit is proving difficult.

Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they don’t have the market share to justify a move like that yet to me.

This is not priced at inference cost.

My guess: it's the price at which they make more money than if they rent the TPUs to other companies.

The Gemini team has had trouble securing enough TPUs for their user's needs. They struggle with load and their rate limits are really bad. Maybe at a higher price, they have a better chance at getting more TPUs assigned?

The cost at such they could rent out the TPUs, i.e. the market rate, is the inference cost.

Just because you are vertically integrated doesn't mean you get to discount the one business units products to the other. Doing so discounts the opportunity cost you pay and is just bad accounting.

Basic business principle, you charge what people are willing to pay not what it costs.

Look up “double marginalisation”.

Depends on if you have spare capacity I think. They have minimal competition so they might be maximizing profit by charging prices higher than what clears all their supply.

Prevailing wisdom is that serving LLMs at a profit is achievable... it's when you factor in the cost of training them that prices get astronomical real fast.

Open-source model inference providers (who do not have to bear the cost of training) seem able to do it at much lower prices.

https://www.together.ai/pricing

https://fireworks.ai/pricing#serverless-pricing (scroll down to headline models)

Of course, it's possible that they are burning through investor cash as well, and apples-to-apples comparisons are not possible because AFAIK Google does not mention the size/paramcount for 3.5 Flash.

But if the prevailing wisdom is true, I think it's actually encouraging. It suggests that OpenAI and Anthropic could perhaps, if they need to, achieve profitability if they slow down model development and focus on tooling etc. instead. If true that's probably good news for everybody w.r.t. preventing a bursting of this economic bubble.

...my opinions here are of course, conjecture built on top of conjecture....

Most of the training cost is not in the final training run, it's in all of the R&D (including salaries, equity, etc.) that it takes to get to the final training run. The actual cost of all of the TPUs (or GPUs), power, networking, storage, etc. for the final training run is significant, but it's even more expensive to have this huge R&D team doing frontier model development and using a lot of those same resources during development.

I think you're right that releasing models at a slower cadence would bring down costs to some degree, but it's not clear how much. All of these companies could significantly reduce their opex but at the risk of falling behind in terms of being at the frontier.

Not to discredit you, because you are 100% correct but tangential note about together.ai, they seem fairly unreliable with constant outages or higher than normal latency.

Its probably that in 1 or 2 years local (free) models will completely take the place of cheap models so cheap models need to move up the quality chain.

You have free local models for most tasks, $20 subscriptions for near-frontier intelligence, and API per token costs for frontier intelligence.

Flash seems to be targeting the near-frontier category.

That might work if it wasn't for FOMO. Are you ok with only $20 of frontier usage a month?

Subjective, but if we compare to compute not everyone needs the most expensive laptops or super computers for their work.

I think frontier models will be invaluable for scientific research, defense, financial analysis and such. But the average person probably would be reasonably well-served with a local model.

If you're in sales, customer service, product management and such - the leading open models at the 30B mark are already good enough.

This is trouble if you're not Google/OpenAI/Anthropic: they're all shifting towards pricing for the economic value of the knowledge work they're aiding.

The economic value increases non-linearly as models get more intelligent: being 10% more capable unlocks way more than 10% in downstream value.

That's trouble because the non-linear component means at some point their margins will stop primarily defined by the cost of compute, and start being dominated by how intelligent the model is.

At that point you can expect compute prices to skyrocket and free capacity to plummet, so even if you have a model that's "good enough", you can't afford to deploy it at scale.

(and in terms of timing, I think they're all well under the curve for pricing by economic value. Everyone is talking about Uber spending millions on tokens, but how much payroll did they pay while devs scrolled their phones and waited for CC to do their job?)

Maybe the margins are just very large for Google because they predict so much demand for 3.5?

This combined with locally runnable models getting pretty good recently (e.g. Qwen 3.6) tells me that it's time to seriously consider local dev setup again

Besides the cost you get the control, transparency and ability to identify small language models or LoRA you want to serve even more cost effective.

This should become the new Apple's hardware and software play. I am hopeful about the new CEO

We need another "Deepseek moment" or else it will become impossible for the regular dude to use AI. It will become something that only big companies can afford.

We're having DeepSeek moments every couple of weeks.

Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.

And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.

The winners in the AI war will be the companies that figure out how to run them efficiently, not the ones that eke out a couple percent better performance on a benchmark while spending ten times as much on inference (though the capability has to be there, I think we're seeing that capability alone isn't a strong moat...there's enough competent competition to insure there's always at least a few options even at the very frontier of capability).

> It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

You can lower that to at least 24GB. I've been running Qwen 3.5 and 3.6 with codex on a 7900 XTX and the long horizon tasks it can handle successfully has been blowing my mind. I would seriously choose running my current local setup over (the SOTA models + ecosystem) of a year ago just based on how productive I can be.

Gonna try it.

We have Qwen 3.6-35b (6) on a 5090 (32GB) and it's blowing me away. Works fine for most (not all) code generation tasks. One developer here has been extremely stubborn about adopting AI; he's finally adopted it, albeit only when it's coming from a local model like this.

DeepSeek V4 Pro likewise is insanely good for the price. I simply point it at large codebases, go get a cup of coffee or browse Hacker News, and then it's done useful work. This was simply not possible with other models without hitting budget problems.

Any chance you'd be willing to talk further about your setup? I have 2 x 3090s in a local machine, and I'm still left with questions about how best to use stuff locally.

You can only run heavily quantized models on all 3/4/5 rtx gpus (with 32gb or less vram) - and you probably want moe versions like Qwen 35b for this to run at speed somewhat comparable to Claude. It’s still not there to be honest but getting there. Personally I mess around with llama.cpp on m5 max with 128gb - it’s a decent setup to try various medium sized things, and runs llms surprisingly well without quantization, at least the moe models.

How is that machine for local inference? It's a serious consideration for me, but getting to hear more from folks that already have it would be helpful.

Two 3090s is 48GB, so it's possible to run the 6-bit quantization comfortably, which is fine. It doesn't start to get notably dumber until lower than that. It won't be as fast as a hosted model, but dual 3090s will be comfortably fast for interactive use with the MoE version and not terrible to use with the dense model. I run the dense model at 8 bits on my dual Radeon V620 desktop machine, which I think would be slower than two 3090s, or at least not notably faster.

Have you done comparisons with 4 bit and seen a noticeable difference for coding tasks?

No, I've just seen benchmarks showing most models start degrading around 4-5 bits. That's not to say they become useless, just that down to about 6-bits (with careful hybrid quantizations like unsloth where some of the layers aren't quantized or are quantized at higher bit depths) the quality isn't measurably degraded, but below that there are measurable differences in performance.

People report good results from DeepSeek V4 Flash at 2 bits (the DwarfStar 4 folks are doing it, and I've tried it on my Strix Halo, but it's too slow to be usable, so I haven't bothered to figure out if it's actually smart enough to use for anything).

Anyway, it's obvious models have to degrade in terms of knowledge, at any quantization, even though it may not show up clearly on benchmarks until lower. If you halve the size of the data available, it necessarily loses information about the world.

The data I've seen is stuff like the KL Divergence comparisons that Unsloth does which show something but not clearly whether there's an observable or significant difference in task performance.

One of the things I'm wondering about is what I'm missing for $LLM to create files on the local FS like Claude and Codex do. What I see instead is stuff just printing to stdout, rather than files on the filesystem.

What am I missing?

You're missing an agent. The model uses tool calls to interact with the filesystem, commands on the system, optionally search (you need a search MCP server, like Brave or Exa, and API key), etc.

I usually use the Zed Agent built into Zed editor for self-hosted models, but you could use Pi, OpenCode, Hermes, Claude Code, etc. there are many, many, agents.

The model just predicts text, Claude Code etc parse the output and do the actual file creation (or run shell commands that do it). If you have Claude Code installed look in ~/.claude/projects/... and you can see the transcripts of your actual sessions, or install Mini-SWE-Agent and play with that to get a feel for what's going on.

Deepseek had another moment a few weeks ago. V4 isn't far behind the US frontier, and so far its flash variant seems a very reliable coder and costs a pittance.

Deepseek V4 (not flash) trippled in price too by the way (from Deepseek). Get used to this pattern.

This is what you get for relying on the generosity of billionaires. Keep offshoring your thinking ability to a machine and let me know how competitive you. Hint, you wont be. There's nothing special about being able to use an LLM.

Unlike other providers, Deepseek does promise that they will lower the price when their Huawei cards arrive in a few more months.

Give me a link. Cannot wait. One PSA is that they have 75% discount right now so it is already cheaper than the full price.

Weird, last time I checked it was right on the pricing page.

But even when it happens I doubt it would be as cheap as it is right now. Enjoy it while it lasts!

Actually, deepseek v4 was 1/3 promotional price for the first month or so. This was pretty clearly communicated. The promotions window just ended is all.

Anyone can host Deepseek V4 on rented GPUs and sell inference on it. Price will very quickly converge to the marginal cost of inference. This is as close to a pure commodity as it gets in the AI space so competitive market economics will put in work. Same is true for any open-weights model.

You dont understand the costs involved to run inference at scale

Please go run some numbers.The hardware needed to Run Deepseek v4 flash at 20 tps for a single session is nowhere close to what is required to run it at 50tps for 5,000 concurrent sessions.

Imagine what it takes to be profitible when running at 150 tps for 30cents per 1mm. You make less than 1k per month and the hardware required to run that cost 10k a month to rent with hardly any concurrent session capability.

> Please go run some numbers.

- DeepSeek serves DeepSeek V4 Pro at 27 tps: https://openrouter.ai/deepseek/deepseek-v4-pro

- At 27 tps per user, a B300 GPUS will give you around 800 tokens per second (serving 30 users): https://developer-blogs.nvidia.com/wp-content/uploads/2026/0...

- That's 800 * 60 * 60 generated tokens per hour, at a cost of $0.87 per 1M tokens, or $2.50 per hour.

- For input and output tokens, the math is a bit more complicated because we have to make assumptions about their ratio. Using the published values from OpenCode, we get another $2.50 for cached tokens (which are almost free for DeepSeek) and another $3.40 for input tokens (which are a lot cheaper to compute than output tokens), which gives us a total of $8.50 per hour per B300 GPU.

- B300 GPUs can be rented for as low as $3.40 per hour, which is less than $8.50, so hosting DeepSeek V4 Pro is profitable.

You could also host it at fewer tps per user to raise the efficiency and therefore the profit even higher.

Even not assuming Blackwell inference the $3.50/hr price is likely close to the marginal cost. The Deepseek R0 model is a little more than a third of the size of V4 and cost around $1/Mtok to serve at scale based on deepseek's blogs last year and Hopper rental prices.

Yes it is more efficient in $/tok to run at scale than to run just for yourself. Everyone selling Deepseek V4 inference is selling an undifferentiated good. They have run the numbers on how much it costs and are competing against a dozen other outfits also selling undifferentiated open weights tokens. Whatever the dollar cost they face to rent those GPUs will be what they are able to charge in the competitive market. That is great for you and me because we can buy tokens at pretty much exactly what it costs to produce them.

Mate why are you so mad at people upset the price trippeled? It's a fair complaint that people built services using the cheaper ones with the expectation future models would be similarly priced. You can avoid 'offloading thinking' while still building ontop of these models

V4-Pro is about 2.4× total params and 1.3× active params of V3.2.

You're typing as your handwriting and letter sending abilities deteriorate to dust. Writing down information as your memory capacity decays. Remembering instead of living at the pure leading edge of perception dulling your reactions.

Smh, it's all downhill from the first unadulterated neuron.

I think demand is too great and compute is not enough. Nothing to do with billionaires colluding to increase prices by 3x.

Actually, why should Google collude on pricing? They have deep pockets and could starve out the competition while keeping prices low, if they really wanted.

I think it is priced high because it's basically their smartest model as well as their fastest, so why shouldn't they?

You can still use earlier generations of Flash at a lower cost if you want "fast and cheap and just OK," which often makes sense. (Just checked)

I would predict they will lower this price when 3.5 High appears, but perhaps not all the way.

What we need is a deepseek moment in hardware ie China reaching parity on node size that is the only way latest computers let alone latest ai will be available to us in the future otherwise the profit margins will push most production to AI.

To be honest, China not having access to the latest hardware is exactly what has driven LLM technology forward the last 2 years.

Why?

Because it forced them to focus on efficiency, instead of throwing more compute at the problem.

Just like in software, some of the most beautiful solutions come from constraints. Think, the optimisations that game developers implemented because of the frame budget.

On top of that, China is also facing hardware constraints, which is pushing companies to develop better domestic chips for AI training. It'll be interesting to see how things perform once Huawei's newer hardware is fully deployed at DeepSeek.

Open Source ASML EUV. But will wipe off trillions from US stocks so 401k may not like that.

You can use lots of open weight models today.

That's one solution to the problem. But it still needs some good computational capabilities. Either we optimize the hell out of those models, or we wait for the hardware to become good enough for them.

The real problem is the hardware to run them is still very expensive.

Maybe we can figure out better ways to use the models that can run on cheap hardware.

gemini isn't even that good. just tested 3.5 on usual complex prompts to opus/chat 5.5. meh

Are you really comparing flash to opus? Shouldn't you be comparing pro?

The benchmark tables in the Google announcement include Opus 4.7, and the numbers are very impressive. Caveat emptor, but it's not unreasonable to compare a new Flash to a current-gen Opus, even if some of the results confirm expectations

Who would have guessed that something costing roughly a third as much wouldn't do as well at certain tasks.

Well, the first impression is that Gemini still goes off the instruction rails easier than other models, but I noticed that it tends to go back to the initial goal without holding a hand, which is a real improvement. It's really interesting that these models behave so differently.

3.5 flash is listed as stable rather than preview, or am I misreading?

https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...

ah I mistakenly wrote preview

3.1 flash lite — $0.25/$1.50 — plus insanely fast.

3.1 flash lite isn’t quite as good as 3 flash preview (which is the most incredible cheap model… I really love it) — but 3.1 is half the price and the insane speed opens up different use cases.

For comparison, Opus models are $5/$25

Opus 4.7 is smarter than even Gemini 3.1 Pro on nearly every metric, though. You're comparing apples to oranges. Gemini 3.1 Flash is somewhere in the neighborhood between current Haiku and Sonnet, I think? Still a better value than the Anthropic models, I guess, which are quite pricey.

Since Gemini 3.5 Flash is raising the price to $1.50/$9.00, it's priced between Haiku and Sonnet. If it outperforms Sonnet, it remains a good value, I guess. Though DeepSeek V4 Flash is much cheaper than all of them, and seemingly competitive.

Definitely apples to oranges, sorry I wasn’t clear. I only included opus pricing for comparison—it is vastly superior. But even 3.1 flash lite is really useful.

Of course, if I manage to reach my limits every week on my Claude $200 sub, opus 4.7 is probably priced closer to flash!

>Opus 4.7 is smarter than even Gemini 3.1 Pro on nearly every metric,

Outside of coding, claude models are pretty meh. GPT and Gemini are the workhorses of science/math/finance.

Not in my fields of science: Genetics and neuroscience. The combination of Opus 4.7 Adaptive used with well structure project folders is amazingly useful.

And even on coding, they are mostly good at generating new code.

They sure are not at thorough analysis or debugging, etc.

To be fair, Gemini 3.1 flash _lite_ supports structured output (guaranteed json), it’s super fast, runs circles around 2.5 flash and costs $0.25/$1.50.

I use it _a lot_ and it’s very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.

That said, I think we’ll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.

Their rationale might be that it’s size and intelligence are growing relative to the market.

Fwiw it’s beating Claude Sonnet in most benchmarking (benchmaxxing?), yet they’ve priced it almost half off on a per token basis.

Question is are you going to persuade anyone with this argument?

Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

> Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

A few weeks ago, Steve Yegge claimed he'd heard that Google employees are banned from using Claude & Codex.

https://x.com/Steve_Yegge/status/2046260541912707471

A number of Googlers replied to say that was totally false, including Demis Hassabis, but they were all on the DeepMind team.

https://x.com/demishassabis/status/2043867486320222333

This person here claims they left Google because of the ban, and because the ban applied outside of Google work as well:

https://x.com/mihaimaruseac/status/2046272726881693960

> and because the ban applied outside of Google work as well

I think false (or hasn't filtered to everyone lol)

I don't think they're really comparable. Seems they created the Flash-Lite tier to take the spot of the old Flash models.

No, 2.5 had both flash and flash lite.

It is Google, after all ....

In general, Gemini flash is still relatively cheaper compared to the "mini" version of the other big 2. However, I agree that newer version seem to have multiple X price increase (similar to the new ChatGPT) and we certainly need competition from the open source models to keep these guys in check with pricing.

Gen AI is unprofitable, especially at the insanely cheap rates they've been offering to get people in the door. So expect more increases in the future.

These companies are unprofitable (as all companies at this stage and ambition should be) but I increasingly don't see any justification for the idea that it is fundamentally unprofitable.

Inference alone is certainly profitable. I'm running models at home that are comparable to performance of paid models a year or so ago for free. Even for much larger models the cost around inference serving are clearly manageable.

Training is where the costs are, but I'm increasingly convinced those too could have costs dramatically reduced if necessary. Chinese companies like Moonshot.ai are doing fantastic work training frontier models for a fraction of the cost we're seeing from Anthropic/OpenAI.

This isn't like Uber or Doordash where the economics fundamentally don't make sense (referring to the early days of these services where rates were very cheap).

It's a compelling story that "current AI is unsustainable", but it doesn't pan out in practice for a multitude of reasons (not the least of which is that we can always fall back to what models did last year for basically free).

And if you can run those strong models at home for free, why would hosting them be a successful business for any of these providers?

Profitable maybe, in terms of having low costs, but why pay Google or whoever when you can do it yourself for cheaper/"free"?

For free == with a huge upfront cost of getting a good enough box and running costs of maintaining it and just keeping it powered. By the time it pays off the frontier labs are three generations ahead at least.

Compare with on-demand billing per token and it just doesn’t make sense to own the hardware if you aren’t using it productively or renting it out for 95% of the time.

If you can run your server at home for free why would hosting it be a successful business for any of these propviders?

Arguably nothing even has to change with training for this to be sustainable. Dario has claimed that Anthropic is profitable on a per training run basis. They aren't profitable because they choose to keep investing in increasingly large training runs.

Cut the crap.

The value of the firm's operating assets = EBIT(1-t) - Reinvestment

You (Anthropic) want that sky-high valuation? Accept reinvestment is part of the equation.

If they decide to stop reinvesting, then they are as good as dead.

Moreover, they clearly are not re-investing cash flows from operations. Why do you think they are continually raising money? Lmao.

If it's profitable, why haven't they reported any profits? People like Ed Zitron have done the math and it just doesn't add up. I mean he just published this piece today: https://www.wheresyoured.at/ai-is-too-expensive/

Amazon was unprofitable for over a decade, and they were public. Theres no incentive to be profitable as a private company if you can continue to raise money.

Ed Zitron and Gary Marcus are... confused.

> Amazon was unprofitable for over a decade, and they were public.

Amazon was unprofitable because they poured their revenue into growth. On paper, they were in the red, but everyone - especially investors - saw what was going to happen, given their trajectory.

Is it the case that any of these AI companies are actually making a ton of money and growing accordingly? AFAICT, we've just got [a] big players like Google that can subsidize AI in the hopes of waiting everyone else out and [b] private companies raising capital in the hopes that when the market returns to rationality, they may be solvent.

Yes that is exactly what is happening. OpenAI and Anthropic are the fastest growing companies by revenue ever and their gross profit margins are healthy.

According to this article[0]:

> HSBC Global Investment Research projects that OpenAI still won’t be profitable by 2030, even though its consumer base will grow by that point to comprise some 44% of the world’s adult population (up from 10% in 2025). Beyond that, it will need at least another $207 billion of compute to keep up with its growth plans.

This article is from six months ago. Was HSBC wrong; did something dramatically change in the last six months; is OpenAI not, in fact, profitable?, or are they in fact doing well but doing a huge investment (as was the case with Amazon 25ish years ago)?

I genuinely do not know, but my impression is that they're burning investment capital trying to compete with others' investment capital and Google's bottomless pockets.

[0] https://fortune.com/2025/11/26/is-openai-profitable-forecast...

Also OpenAI somehow having 44% of the world’s population as its customer base is a plainly absurd goal and will never happen, not in 5 years

and to make matters worse, they are massively over-valued.

Whoever buys the stock at a richly priced 1tn at ipo is a bozo lmao. I know I know, index funds will be forced to hold it bypassing the 1 year rule. Disaster already.

Then why do they constantly need more and more funding from VC and Google and MS and NVIDIA? Why is it all circular dealing? Why aren’t there smaller AI startups running these smaller, “profitable” models?

But I've been told here -- over and over again -- that the cost of inference was going to go down as the technology matured.

The trend lines are going in the opposite direction.

His entire brand is that the AI bubble will burst. By his account it was supposed to have several times by now. Like the doomers, it's not if it's when and they have to keep pushing back their predictions. Funny how both camps can be so confident. Alas, that's how they get eyes, ears and dollars.

That's not to say they will be or are wrong, it's just that they aren't exactly unbiased, or humble, sources.

Yeah, at this point I think the worst-case scenario for OpenAI/Anthropic/etc is to slow down frontier model development and focus on tooling and services, as opposed to imploding completely and bursting the economic bubble. I hope?

If you don't need SOTA or near SOTA there are plenty of dirt cheap models, just look at Gemma 4 31B on Openrouter.

For all of the use cases being hyped you really do, and you actually need something much better than the SOTA models to do what we are being told can be done.

The small models are useful for small things like summarizing text or search but not much else.

Yeah a lot of AI hype is look at the amazing new thing our new model can do! Like Google at this event. But when pressed about its pricing reality the answer is “use a worse cheaper model”?? Real convincing argument there

You mean Kimi or qwen

[flagged]

It is insanely profitable though, if you cut out r&d cost, plus the marketing and loss leaders. Don't let them gaslight you.

Even anthropic who does not own any hardware still have a big margin providing claude models.

Then why haven't they reported any profits using GAAP (generally accepted accounting principles)? They all use ARR which is easily gamed.

They aren't profitable on a GAAP basis and no one claims this. This obsession over profits is misguided. These are hyper growth companies growing at a scale never seen before. It is both deliberate and uncontroversial to invest in growth rather than slowing down to produce profits.

If my retirement money is going to end up invested in these companies, either directly when they IPO or indirectly through compute providers, then I would like to see some proof that they are capable of producing profits. "Trust me bro" just ain't gonna cut it.

I don't really sure, but might be they count hardware purchase as loss, too.

Google has just recently upgraded their TPUs.

Everything is insanely profitable if you ignore the costs.

The premise is if they stop training new models then it will become pure profit after 2 years when the hardware finished paying for itself.

It's pretty funny that everyone say that this business is unsustainable, but I have yet seen anyone bankrupt, even the pure hardware providers who are renting out a100 b200.

And AI investors and stock market boosters are just going to accept OpenAI not having anything "new" to show for all their investments? What about replacing hardware once it's been burned out from constant high usage? Is it not odd to you that so many big AI deals get announced and never heard from again? What's the business reason for neoclouds buying GPU's from NVIDIA only for NVIDIA to then pay them to rent them back? How does this make any sense?

They immediately undercut their argument to the point that I'm not sure if they were being sarcastic.

[dead]

To me this is almost like a tone-deaf naming change.

Empty Slot (new Pro as Mythos competitor?)

Old Pro -> now Flash

Old Flash -> now Flash Lite

Old Flash Lite -> now Gemma (and not served by Google)

I say "almost" because the situation is more fluid and unstable than a normal naming change. If Apple were to do this with laptops, maybe it'd be like, Air gets better and pricier and becomes Pro-level model, Neo same way becomes Air-level model, etc. But Apple's too design oriented to do something like that. Google, well...

This change has made me decide to move to a multi-provider situation like through OpenRouter for consumer-facing LLM api in a service I'm building. I just can't trust Google to not constantly rearrange everything under our feet. Doesn't mean I won't use Gemini, but it clearly means I need to have others in the mix ready to go. In fact I used to use lots of Flash Lite, which is now Gemma territory, and I can't get that served by Google anymore and don't want to run my own hardware.

But in any case, I'd compare this "Flash" model with previous "Pro" on all metrics. It's kinda like if in clothes a Small suddenly became what was a Large, or at Starbucks a Grande became the new de facto Venti. And only for the new! drinks.

And if we think this way, it's possible that prices are actually falling?

Demis is on record saying they need small models on edge devices and if it’s on the edge the weights may as well be public officially.

Yeah, it is a massive jump in price, hardly a "Flash" model anymore... I wonder if they'll release flash lite or something with a bit more affordable price point.

There’s already a flash lite tier since 2.5. Latest is 3.1 currently.

And they are using this to power search answers?

I bet the API pricing helps pay for search users

It might be temporary pricing given that 3.5 Flash is actually superior to the existing 3.1 Pro in almost all regards, so they're in a bit of a lurch as 3.1 Pro really doesn't make sense given that 3.5 Pro has been delayed a bit.

That's a lot. DeepSeek v4 Flash is just over a tenth the price, and DeepSeek v4 Pro is roughly the same price (currently heavily discounted, but will be $1.74).

I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.

They have said AI will be priced like a utility, meaning $100-300 per month or so.

I use Gemini models in Junie daily. When I need accuracy I switch to Gemini 3.1 Pro Preview (why it is still in preview?), but it burns thru credits leaving me topping up $5 every day. 3.1 Flash lite is just not accurate enough. 3 Flash is sweet spot just as Jetbrains suggests it is.

Maybe I'll look at Opus again, but it just was slower, much more expensive and worst at all - wasn't listening to you instructions.

At the same time, it is supposedly Gemini 3.1 Pro level at 3/4 the price

and far cheaper than comparable models, Gemini Pro is cheaper than Claude Sonnet (Anthropic still gets to charge a brand premium)

Gemini 2.5 flash was the best Gemini model.

Not the most intelligent but perfect balance of cheap, fast and not-too-dumb.

The 09-2025 preview was awesome.

just subscribe to the plan, cheaper