These failure modes are not AI’s edge cases at the limit of its capabilities. Rather they demonstrate a certain category of issues with generalization (and “common sense”) as evidenced by the models’ failure upon slight irrelevant changes in the input. In fact this is nothing new, and has been one of LLMs fundamental characteristics since their inception.

As for your suggestion on learning from simulations, it sounds interesting, indeed, for expanding both pre and post training but still that wouldn’t address this problem, only hides the shortcomings better.

Interesting - why wouldn't learning from simulations address the problem? To the best of my knowledge, it has helped in essentially every other domain.

Because the problem at display here is inherent in LLMs design and architecture and learning philosophy. As long as you have this architecture you’ll have this issues. Now, we’re talking about the theoretical limits and the failure modes people should be cautious about, not the usefulness, which is improving, as you pointed out.

> As long as you have this architecture you’ll have this issues.

Can you say more about why you believe this? To me, these questions seem to be exactly of the same sort of question's as on HLE [0], and we've been seeing massive and consistent improvement on it, with o1 (which was evaluated on this question) getting a score of 7.96, whereas now it's up to a score of 37.52 (gemini-3-pro-preview). It's far from a perfect benchmark, but we're seeing similar growth across all benchmarks, and I personally am seeing significantly improved capabilities for my use cases over the last couple of years, so I'm really unclear about any fundamental limits here. Obviously we still need to solve problems related to continuous learning and embodiment, but neither seems a limit here if we can use a proper RL-based training approach with a sufficiently good medical simulator.

[0] https://scale.com/leaderboard/humanitys_last_exam