I wonder if the providers are doing everyone, themselves included, a huge disservice by providing free versions of their models that are so incompetent compared to the SOTA models that these types of q&a go viral because the ai hype doesn't match the reality for unpaid users.

And it's not just the viral questions that are an issue. I've seen people getting sub-optimal results for $1000+ PC comparisons from the free reasoning version while the paid versions get it right; a senior scientist at a national lab thinking ai isn't really useful because the free reasoning version couldn't generate working code from a scientific paper and then being surprised when the paid version 1-shotted working code, and other similar examples over the last year or so.

How many policy and other quality of life choices are going to go wrong because people used the free versions of these models that got the answers subtly wrong and the users couldn't tell the difference? What will be the collective damage to the world because of this?

Which department or person within the provider orgs made the decision to put thinking/reasoning in the name when clearly the paid versions have far better performance? Thinking about the scope of the damage they are doing makes me shudder.

I used a paid model to try this. Same deal.

I think the real misleading thing is marketing propping up paid models being somehow infinitely better when most of the time it's the same exact shit.

I copied/pasted a comment with faulty logic (self-defeating) directly from a HN comment and asked a bunch of models available to me (Gemini and Claude) if it could spot the issue. I figured it would be a nice test of reasoning since an actual human missed it. The only one that found the logic error without help was Claude 4.6 Opus Extending Thinking. The others at best raised relevant counterpoints in the supporting argument but couldn't identify the central issue. Claude's answer seemed miles ahead. I wonder if SotA advancements will continue to distinguish themselves.

Care to share the comment in question with the rest of us so we can check for ourselves? :-)

And midwits here saying "yeah bro they have some MUCH better model internally that they just don't release to the public", imagine being that dense. Those people probably went all in on NFTs too and told other "you just don't get it bro"

I have paid versions of Chat-GPT and Anthropic and set them both to the best model and they both told me to walk.

Claude told me: "Walk! At 25 meters, you'd barely get the car started before you arrived. It's faster and easier on foot — plus you avoid the awkwardness of driving a dirty car just a few seconds down the road."

I opened up a claude code session using opus-4.6 medium thinking.

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Drive — you need the car at the car wash.

but it's so close

It is close, but you still need the car there to wash it! Drive it over, and enjoy the short 50-meter walk back if you want to stretch your legs while it's being cleaned.

I tried the "upside-down" cup question brought up in another comment in this thread, and it also nailed it:

Flip it upside down. The sealed top becomes the bottom (holding your drink), and the open bottom becomes the top you drink from.

IDK, maybe the web versions are not as good at logical reasoning as whatever they're using to power Claude code, or you were unlucky and I was lucky?

Same. Claude nailed both questions, with the slightest hint of "... u serious?"

I pay for the $100 Opus 4.6 plan... maybe that makes a difference?

At this point there are enough reports of people getting these problematic responses with the paid models that it is concerning. Any chance you could post screenshots?

How much is the real (non-subsidized) cost of the "paid" plans? Does anyone in the world have an answer for this?

Also interested in this - the kWh figures people talk about do not match the price of the subscriptions

Nor do they have to. Inference from different users is batched together.

Ok? Even if they're batched? Grid energy is batched too

At work, paid gitlab duo (which is supposed to be a blend of various top models) gets more complex codebase hilariously wrong every time. Maybe our codebase is obscure for it (but it shouldn't be, standard java stuff with usual open source libs) but it just can't actually add value for anything but small snippets here and there.

For me litmus paper for any llm is flawless creation of complex regexes from a well formed prompt. I don't mean trivial stuff like email validation but rather expressions on limits of regex specs. Not almost-there, rather just-there.

I don't think 100% adoption is necessarily the ideal strategy anyways. Maybe 50% of the population seeing AI as all powerful and buying the subscription vs 50% of the population still being skeptics, is a reasonable stable configuration. 50% get the advantage of the AI whereas if everybody is super intelligent, no one is super intelligent.

Their loss

Yes, but the 'unwashed' 50% have pitchforks.

Lots of "unwashed" scientists too.

[dead]

> a senior scientist at a national lab thinking ai isn't really useful because the free reasoning version couldn't generate working code

I would question if such a scientist should be doing science, it seems they have serious cognitive biases

My bad; I should have been more precise: "ai" in this case is "LLMs for coding".

If all one uses is the free thinking model their conclusion about its capability is perfectly valid because nowhere is it clearly specified that the 'free, thinking' model is not as capable as the 'paid, thinking ' model, Even the model numbers are the same. And given that the highest capability LLMs are closed source and locked behind paywalls, there is no means to arrive at a contrary verifiable conclusion. They are a scientist, after all.

And that's a real problem. Why pay when you think you're getting the same thing for free. No one wants yet another subscription. This unclear marking is going to lead to so many things going wrong over time; what would be the cumulative impact?

> nowhere is it clearly specified that the 'free, thinking' model is not as capable as the 'paid, thinking '

nowhere is it clearly specified that the free model IS as capable as the paid one either. so if you have uncertainty if IS/IS-NOT as capable, what sort of scientist assumes the answer IS?

> nowhere is it clearly specified that the free model IS as capable as the paid one either. so if you have uncertainty if IS/IS-NOT as capable, what sort of scientist assumes the answer IS?

Putting the same model name/number on both the free and paid versions is the specification that performance will be the same. If a scientist has to bring to bear his science background to interpret and evaluate product markings, then society has a problem. Any reasonable person expects products with the same labels to perform similarly.

Perhaps this is why Divisions/Bureaus of Weights and Measures are widespread at the state and county levels. I wonder if a person that brings a complaint to one of these agencies or a consumer protection agency to fix this situation wouldn't be doing society a huge service.

They don't have the same labels though. On the free ChatGPT you can't select thinking mode.

> On the free ChatGPT you can't select thinking mode.

This is true, but thinking mode shows up based on the questions asked, and some other unknown criteria. In the cases I cited, the responses were in thinking mode.