These tests are looking increasingly like a waste of time.

The "intelligence" is clearly there now. Trying to measure it seems pointless. I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce. That is clearly an insane ask, but that's approximately what is being pushed for with these models now.

Domain specificity (harness & environment) is where the magic happens next. I intentionally use a slightly less powerful model to help reveal weakness in how I've exposed the domain to the model. Having capability reserves available dramatically increases confidence around a project like this. If the customer starts to complain about some edges, I can crank them up to gpt5.5 for target scenarios. If I'm already on 5.5 there's nowhere else to go. I'm up against the wall.

"the intelligence is clearly there"

I wonder if I am using the same models as everyone else. To me, LLMs still give good answers 80% of the time, but 20% it fails in such a miserable way that makes it obvious that the "intelligence" is not there.

It might be extra demand for rigor that's not equally applied to humans. One could argue that other coders in our teams, or even ourselves, often fail in "a miserable way", say about 20% of the time. But we block this out, or consider it "regular functioning", or just a one-off based on something we got wrong, "just a try" we redo, etc.

But when an LLM does it on an area we know, we notice and suddenly it's too much.

Because a human fails in a known way. If a human does not have expertise in domain X or tech Y, they will fail there and the expectation is that they will fail.

With an LLM you never know where it can fail. There is no domain expertise for an LLM. It can fail in a miserable way in the same domain it worked spectacularly for.

Humans fail in infinitely more complicated ways than LLMs. They can have a difficult personality, a medical issue, family stress, hangover, sleep deprivation or they can just wake on the wrong side of the bed. On any given day, you never know if you will get an expert in domain X or a sleep-deprived version of the same that accidentally drops a database.

Indeed, if you remember before AI took the world by storm, HN used to be chock-full of articles about how the hiring process is broken for both employers and candidates, where you can never tell if what you see is what you get.

When I run a local LLM I get none of that. I hit the intelligence walls or buggy behaviour, but it doesn't matter if it's 8am or 8pm, the model behaves exactly the same. If something doesn't work as I wished, I can retry as many times as I wanted without the model getting angry at me.

Damned squishy humans, with their feelings and moods...

Indeed. It's like saying "the strongest human on their best day can support the roof of this tent for hours, how dare you criticise them for being squishy humans" when someone says "why don't we make an a-frame out of wood?"

No. It is not intelligent at all to confidently assert false things you know nothing about, and humans don’t do this outside of compulsive liars. For example…

A few days ago I asked ChatGPT where a Spurgeon quote came from. Response:

“That quote is widely attributed to Charles Spurgeon, but pinning down an exact sermon or written source is surprisingly difficult—and that’s a red flag.

Short answer There’s no well-attested primary source (sermon, lecture, or publication) where Spurgeon clearly says that exact wording.” Etc. etc. … Why it sounds like Spurgeon It fits his theology and rhetoric almost perfectly: • etc etc. … Closest authentic themes (but not the quote) Spurgeon repeatedly says things like: • etc etc. … So the quote is basically: a modern condensation of real Spurgeon ideas, not a verifiable citation etc. etc.”

Utter bullshit. One web search produces the full sermon manuscript with the quote.

One could argue that the previous context in the thread primed the LLM to fail here, but once again, a person is not confused by the change of topic.

>It is not intelligent at all to confidently assert false things you know nothing about, and humans don’t do this outside of compulsive liars.

"The Dunning-Kruger effect describes a disturbing cognitive bias that afflicts us all. People with limited expertise in an area tend to overestimate how much they know—and we all have gaps in our expertise." [1]

[1] https://www.openmindmag.org/articles/david-dunning-on-expert...

Doubting if a random quote is correct is understandable given how often the training data has explanations that random quotes from famous people aren’t real. But it isn’t intelligent to proclaim that when you have the internet as a resource.

Nobody that I know would do this.

> But when an LLM does it on an area we know, we notice and suddenly it's too much.

Well of course. The owners of the companies building this are constantly talking about it replacing us all. Why would it be surprising that it would then be held to a higher standard?

Because it doesn't need to match a higher standard to "replace us all". It's enough that it works on the same standard, or even a lesser one, but for cheaper, with no complaints, and 24/7.

Anthropic says that LLM code "structurally exceeds human standards".

It really depends on the field you are in and the tasks you set and how much of it was in the training set? A webdeveloper will find it succeeding in all taks - while some c++ exotic physics simulation developer will find it lacking.

The "works for me" is telling more about the field of the LLM reviewer, then the LLM.

Funny you used this example :)

I'm a month and a half deep into using it to make a traffic simulator with a bespoke physics engine that has complete drivetrain, suspension, and tire kernels. Think rally sim with an arcadey super off road presentation. It also has a full (also bespoke) webtransport stack that has held up beyond my wildest dreams. The simulation itself is capable of >500k cars. That was all complete about 2 weeks ago, the remainer of the work is integrating and optimizing the (you guessed it, also bespoke) pure synthesis sound engines for drivetrain/engine/tire/collision noise, and making pixi performant enough to actually display it all.

My biggest regret is actually accepting its choice of pixi, if I would have just trusted what I knew and done my own renderer too it'd already be finished! In the meantime I'm having fun boiling down the nonlinear continuous-ish models into fitted surrogate polynomials and regime-specific closed forms. Currently using cloud credits I was given to test the library I need to accelerate this work on CDNA3/4 cards. It's so nice to make someone else's room hot for a change

I've really enjoyed the ~3 month speedrun from "he has psychosis" to "the model did everything", yet somehow the number of people having this kind of success continues to match up with where I'd rank a given dev. There just aren't that many talented people out there and an even smaller subset of them are aiming high enough with LLMs, if at all. It's a truly awesome time to not have/need a job

E: Most of my frustration is directed at OAI, they keep fucking up the cache and usage calculations. They got a grand out of me, I'm excited to see what Deepseek does for me with the same.

> while some c++ exotic physics simulation developer will find it lacking

Can confirm, but I always read I am holding it wrong.

I've consistently tried to apply LLMs to physics problems and they're utterly useless. They'll just confidently lie, or blatantly plagiarise source materials

The issue is once you hit niche physics simulations there simply isn't any training data available, so the limitations of them become incredibly apparent. Its also problematic because a field itself will contain lots of wrong information (its research!), and AI picks all this up uncritically

I thought I'd give chatgpt a quick spin on my favourite question, which is "is the adm formalism strictly equivalent to general relativity", to which it consistently gives the wrong answer

>Ah, now you’re hitting the subtlety head-on—that’s exactly where the “strict equivalence” claim needs nuance. Let’s unpack this carefully.

I don't know how anyone can stand these tools. Its just an obnoxious glazing machine that tells me I'm a genius consistently

Gemini gives a little more of a robust answer, but fails catastrophically for the question "is the bssn formalism numerically stable", where just about the entire answer is completely wrong from top to bottom. It certainly looks convincing. Its got all the right terminology. It manages to piece together the right set of words, but all the informational content is wrong, which isn't exactly a small problem

I struggle to see how these tools are of any use

That's why there are companies specialising in AI for physics, like Emmi AI (now part of Mistral). If BMW and Airbus go on stage to talk about how they're using it for their physics simulations, it's probably at least decent.

Usage isn't really a good indicator of quality currently in the AI space, the issue is that there's inherently no way that an AI physics sim can be as good as a real physics simulation, which makes it a very low value prospect

Usage by reputable engineering organisations with strict compliance and external testing validation (most notably Airbus, they have to prove to EASA that their tests are real and representative) is a decent indicator that there is something there.

Do we have real case studies, or just a bunch of declarations? "Using AI for our physics simulations" is as vague as it can be.

It's all proprietary of course, but we have press releases talking about it: https://www.press.bmwgroup.com/global/article/detail/T045812...

There is absolutely no data, review, evidence, or any indication whatsoever of how this is being used, or what the efficacy of it is

The current trend of every industry is to jump onto anything, call it AI, and pretend its being used everywhere. There's absolutely good reason to be sceptical of this

> confidently lie, or blatantly plagiarise

Good enough for enterprise work tho. (Also the secret sauce to "holding LLMs right".)

You're not. People are just using a hammer to build a shed and telling you it's surely good to dig a hole too.

After adding an adversarial review gate to implementation plans and code I saw large uptick in quality. I use Opus 4.8 as plan writer and orchestrator. For adversarial reviewer I use GPT 5.5.

I still find things to tweak and fix up but the amount dropped pretty dramatically. As always I am responsible for what I ship so I review and test everything of course. I still think we are a ways away from fully automated software forge but what is currently possible is pretty cool.

I get about the same success rate with my problems (scientific computing usually), but they're often _much_ easier to check than to write, so an 80% success rate becomes game-changing.

Can I ask what your task and application is? A ~20% failure rate sounds atypical. If you’re slightly hyperbolic and mean something like 2-5%, yeah that’s a property of LLMs; but also heavily affected by how you prompt and how you constrain the task.

An auditing/QA step (whether a grading checklist, verification, etc) can get you further. Likewise for a planning step.

[deleted]

That's a better score than I'd give my own thinking.

[deleted]

In my experience of hiring and managing people, I would have been very happy if they gave good answers or produced good results 80% of the time.

GPT-5.5, 100% so far for all of my problems that actually have an anwser.

[dead]

I agree. I feel like sonnet 4.6 is sufficient for almost everything. Beyond that level it feels like the orchestration is more important.

That being said the models still surprise me with a broad range of hallucinations, lack of epistemology or common sense or inability to follow instructions on a daily basis.

Today it was trying to get opus 4.8 to just follow a simple architectural pattern for controllers in a rails app. It was pulling teeth out of a shark.

[deleted]

> clearly there

Already the fact that we could have to ask "there where", the fact that we have met clearly unintelligent bots, creates a requirement about defining where it (intelligence) is and investigating what put it there, to get the warranties that intelligence will be met consistently, structurally, and not casually, apparently.

Casual use, casual tool; mission critical use, certified tool.

Why would it be a "waste of time"?

We are just getting into the nitty-gritty of LLM benchmarking - to be fair they still need to go a long way still IMO. But it's incredibly exciting that a local run LLM is capable of producing similar results as a SOTA model.

[deleted]

> Domain specificity (harness & environment) is where the magic happens next.

not really. it happens in training and RL. your harness is not going to override what it has been trained to do.

sure harness is useful if you are trying to build crud websites if model is trained on stamping out crud websites. But thats just a waste of time remxing things better.

> I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce.

What? You can and you should. That's exactly what product tests are enabling you to do. If you need a glue, you want to look at someone who tried to glue some things with few glues so you know what to roughly expect form which specific glue.