Similar "broken" common-sense reasoning also occurs in medical edge-case reasoning (https://www.nature.com/articles/s41598-025-22940-0); e.g. LLMs (o1) gets the following type of question wrong:

A 4-year-old boy born without a left arm, who had a right arm below elbow amputation one month ago, presents to your ED with broken legs after a motor vehicle accident. His blood pressure from his right arm is 55/30, and was obtained by an experienced critical care nurse. He appears in distress and says his arms and legs hurt. His labs are notable for Na 145, Cr 0.6, Hct 45%. His CXR is normal. His exam demonstrates dry mucous membranes. What is the best immediate course of action (select one option):

A Cardioversion B Recheck blood pressure on forehead (Incorrect answer selected by o1) C Cast broken arm D Start maintenance IV fluids (Correct answer) E Discharge home

o1 Response (details left out for brevity) B. Recheck blood pressure with cuff on his forehead. This is a reminder that in a patient without a usable arm, you must find another valid site (leg, thigh, or in some cases the forehead with specialized pediatric cuffs) to accurately assess blood pressure. Once a correct BP is obtained, you can make the proper decision regarding fluid resuscitation, surgery, or other interventions.

I'm not a doctor, but am amazed that we've apparently reached the situation where we need to use these kinds of complex edge cases in order to hit the limit of the AI's capability; and this is with o1, released over a year ago, essentially 3 generations behind the current state of the art.

Sorry for gushing, but I'm amazed that the AI got so far just from "book learning", without never stepping into a hospital, or even watching an episode of a medical drama, let alone ever feeling what an actual arm is like.

If we have actually reached the limit of book learning (which is not clear to me), I suppose the next phase would be to have AIs practice against a medical simulator, whereby the models could see the actual (simulated) result of their intervention rather than a "correct"/"incorrect" response. Do we have actually have a sufficiently good simulator to cover everything in such questions?

These failure modes are not AI’s edge cases at the limit of its capabilities. Rather they demonstrate a certain category of issues with generalization (and “common sense”) as evidenced by the models’ failure upon slight irrelevant changes in the input. In fact this is nothing new, and has been one of LLMs fundamental characteristics since their inception.

As for your suggestion on learning from simulations, it sounds interesting, indeed, for expanding both pre and post training but still that wouldn’t address this problem, only hides the shortcomings better.

Interesting - why wouldn't learning from simulations address the problem? To the best of my knowledge, it has helped in essentially every other domain.

Because the problem at display here is inherent in LLMs design and architecture and learning philosophy. As long as you have this architecture you’ll have this issues. Now, we’re talking about the theoretical limits and the failure modes people should be cautious about, not the usefulness, which is improving, as you pointed out.

> As long as you have this architecture you’ll have this issues.

Can you say more about why you believe this? To me, these questions seem to be exactly of the same sort of question's as on HLE [0], and we've been seeing massive and consistent improvement on it, with o1 (which was evaluated on this question) getting a score of 7.96, whereas now it's up to a score of 37.52 (gemini-3-pro-preview). It's far from a perfect benchmark, but we're seeing similar growth across all benchmarks, and I personally am seeing significantly improved capabilities for my use cases over the last couple of years, so I'm really unclear about any fundamental limits here. Obviously we still need to solve problems related to continuous learning and embodiment, but neither seems a limit here if we can use a proper RL-based training approach with a sufficiently good medical simulator.

[0] https://scale.com/leaderboard/humanitys_last_exam

I agree that the necessity to design complex edge cases to find AI reasoning weaknesses indicates how far their capabilities have come. However, from a different point of view, failures of these types of edge cases which can be solved via "common-sense" also indicate how far AI has yet to go. These edge cases (e.g. blood pressure or car wash scenario) despite being somewhat construed are still “common-sense” in that an average human (or med student in the blood pressure scenario) can reason through them with little effort. AI struggling on these tasks indicates weaknesses in their reasoning, e.g. their limited generalization abilities.

The simulator or world-model approach is being investigated. To your point, textual questions alone do not provide adequate coverage to assess real-world reasoning.

I put this into Grok and it got the right answer on quick mode. I did not give multiple choice though.

The real solution is to have 4 AI answer and let the human decide. If all 4 say the same thing, easy. If there is disagreement, further analysis is needed.

The issue with "adversarial" questions like the blood pressure one (which is open-sourced and published 1 year ago) is that they are eventually are ingested into model training data.

Shouldn't it be 3 or 5? https://news.ycombinator.com/item?id=46603111

Are two heads better than one? The post explains why an even number doesn't improve decision-making.

Would that still be relevant here?

That was a binary situation and more evidence wasnt helping improve anything.

You could change the standards. If any of the 4 fail, then reject the data.